[00:06:49] (03PS3) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944) [00:09:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:10:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:11:39] (03CR) 10CI reject: [V: 04-1] mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [00:13:54] PROBLEM - Check unit status of purge_vm_backup on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:14:42] PROBLEM - Check unit status of purge_vm_backup on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:19:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:22:40] (03CR) 10Ssingh: [C: 03+1] Depool esams for duration of esams -> knams migration [dns] - 10https://gerrit.wikimedia.org/r/947945 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [00:23:45] (03CR) 10Krinkle: "recheck" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [00:23:54] (03CR) 10Ssingh: Release 1.9-4 to target bullseye (031 comment) [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [00:24:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:24:58] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:26:13] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) 05Resolved→03Open @darthmon_wmde this seems to be the same key used to access Wikimedia Cloud Services. Could you please generate a separate SSH key for accessing... [00:29:32] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:30:18] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:31:07] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 (owner: 10TrainBranchBot) [00:31:58] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [00:32:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [00:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T342617)', diff saved to https://phabricator.wikimedia.org/P50433 and previous config saved to /var/cache/conftool/dbconfig/20230811-003243-ladsgroup.json [00:32:47] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:34:36] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:35:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:35:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:35:50] (03CR) 10Ssingh: "Let's discuss this because we sadly cannot upgrade to 9.2.1 yet, till we resolve the issue with the plugin execution time, independent of " [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947394 [00:38:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947394 (owner: 10TrainBranchBot) [00:39:10] (03CR) 10Ssingh: "Irrespective of that, the error we are seeing here is:" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [00:39:44] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:39:46] (03CR) 10Ssingh: Release 9.2.1-1wm2 to target Bookworm (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [00:40:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [00:40:20] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [00:40:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T342617)', diff saved to https://phabricator.wikimedia.org/P50434 and previous config saved to /var/cache/conftool/dbconfig/20230811-004036-ladsgroup.json [00:40:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:40:40] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:44:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:45:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:54:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:54:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:54:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:55:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947394 (owner: 10TrainBranchBot) [00:55:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:55:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:59:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:04:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:05:02] (03PS2) 10Krinkle: webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [01:05:19] (03CR) 10Krinkle: [C: 03+1] webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [01:07:01] (03CR) 10Krinkle: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [01:09:39] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:18:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T342617)', diff saved to https://phabricator.wikimedia.org/P50435 and previous config saved to /var/cache/conftool/dbconfig/20230811-012144-ladsgroup.json [01:21:49] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:24:28] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:34:37] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P50436 and previous config saved to /var/cache/conftool/dbconfig/20230811-013650-ladsgroup.json [01:43:09] !log [WDQS] `ryankemper@wdqs2007:~$ sudo pool` (Caught up on lag) [01:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:44] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:49:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:51:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P50437 and previous config saved to /var/cache/conftool/dbconfig/20230811-015156-ladsgroup.json [01:52:46] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:00:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:06:40] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T342617)', diff saved to https://phabricator.wikimedia.org/P50438 and previous config saved to /var/cache/conftool/dbconfig/20230811-020703-ladsgroup.json [02:07:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [02:07:07] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:07:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [02:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T342617)', diff saved to https://phabricator.wikimedia.org/P50439 and previous config saved to /var/cache/conftool/dbconfig/20230811-020724-ladsgroup.json [02:11:20] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:34] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T342617)', diff saved to https://phabricator.wikimedia.org/P50440 and previous config saved to /var/cache/conftool/dbconfig/20230811-021914-ladsgroup.json [02:19:19] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:24:00] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:39] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T342617)', diff saved to https://phabricator.wikimedia.org/P50441 and previous config saved to /var/cache/conftool/dbconfig/20230811-022820-ladsgroup.json [02:28:25] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:31:34] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P50442 and previous config saved to /var/cache/conftool/dbconfig/20230811-023420-ladsgroup.json [02:38:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:24] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:43:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P50443 and previous config saved to /var/cache/conftool/dbconfig/20230811-024327-ladsgroup.json [02:44:33] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:49:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P50444 and previous config saved to /var/cache/conftool/dbconfig/20230811-024927-ladsgroup.json [02:53:36] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P50445 and previous config saved to /var/cache/conftool/dbconfig/20230811-025833-ladsgroup.json [03:04:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T342617)', diff saved to https://phabricator.wikimedia.org/P50446 and previous config saved to /var/cache/conftool/dbconfig/20230811-030433-ladsgroup.json [03:04:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:04:38] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [03:04:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:04:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T342617)', diff saved to https://phabricator.wikimedia.org/P50447 and previous config saved to /var/cache/conftool/dbconfig/20230811-030454-ladsgroup.json [03:10:18] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T342617)', diff saved to https://phabricator.wikimedia.org/P50448 and previous config saved to /var/cache/conftool/dbconfig/20230811-031339-ladsgroup.json [03:13:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [03:13:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [03:13:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [03:14:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T342617)', diff saved to https://phabricator.wikimedia.org/P50449 and previous config saved to /var/cache/conftool/dbconfig/20230811-031400-ladsgroup.json [03:20:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:26:02] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:40:04] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:44:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:46:20] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:59:40] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:02:41] (03PS1) 10Tim Starling: Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) [04:04:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:09:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:10:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:17:06] (03PS1) 10Tim Starling: ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) [04:18:42] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:32] (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:20:14] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:49] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:31:36] (03CR) 10CI reject: [V: 04-1] ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) (owner: 10Tim Starling) [04:32:24] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:35:32] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:55] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:39:06] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:53:04] (03PS1) 10Tim Starling: Downgrade Parsoid in wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) [04:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T342617)', diff saved to https://phabricator.wikimedia.org/P50450 and previous config saved to /var/cache/conftool/dbconfig/20230811-045307-ladsgroup.json [04:53:19] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:55:05] <_joe_> jbond: please silence config-master servers [04:56:00] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:59:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T342617)', diff saved to https://phabricator.wikimedia.org/P50451 and previous config saved to /var/cache/conftool/dbconfig/20230811-050110-ladsgroup.json [05:01:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:03:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P50452 and previous config saved to /var/cache/conftool/dbconfig/20230811-050814-ladsgroup.json [05:08:26] (03CR) 10CI reject: [V: 04-1] Downgrade Parsoid in wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [05:11:34] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P50453 and previous config saved to /var/cache/conftool/dbconfig/20230811-051616-ladsgroup.json [05:18:45] (03CR) 10Tim Starling: "recheck" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [05:18:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P50454 and previous config saved to /var/cache/conftool/dbconfig/20230811-052320-ladsgroup.json [05:23:48] !log oblivian@deploy1002 Synchronized private/PrivateSettings.php: Adding proxy vendors (duration: 07m 33s) [05:23:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:27:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T342617)', diff saved to https://phabricator.wikimedia.org/P50455 and previous config saved to /var/cache/conftool/dbconfig/20230811-052731-ladsgroup.json [05:27:35] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:31:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P50456 and previous config saved to /var/cache/conftool/dbconfig/20230811-053122-ladsgroup.json [05:33:20] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T342617)', diff saved to https://phabricator.wikimedia.org/P50457 and previous config saved to /var/cache/conftool/dbconfig/20230811-053826-ladsgroup.json [05:38:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:38:31] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:38:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T342617)', diff saved to https://phabricator.wikimedia.org/P50458 and previous config saved to /var/cache/conftool/dbconfig/20230811-053847-ladsgroup.json [05:38:58] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:42:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P50459 and previous config saved to /var/cache/conftool/dbconfig/20230811-054238-ladsgroup.json [05:43:16] (03CR) 10Tim Starling: "Same nodejs assertion failure the second time" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [05:43:49] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:44:02] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:46:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T342617)', diff saved to https://phabricator.wikimedia.org/P50460 and previous config saved to /var/cache/conftool/dbconfig/20230811-054628-ladsgroup.json [05:46:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [05:46:33] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:46:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [05:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T342617)', diff saved to https://phabricator.wikimedia.org/P50461 and previous config saved to /var/cache/conftool/dbconfig/20230811-054649-ladsgroup.json [05:48:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:53:56] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P50462 and previous config saved to /var/cache/conftool/dbconfig/20230811-055744-ladsgroup.json [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230811T0600) [06:03:22] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:08] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:05:54] (03PS1) 10Marostegui: install_server: Do not reimage db2189 [puppet] - 10https://gerrit.wikimedia.org/r/947990 [06:07:10] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2189 [puppet] - 10https://gerrit.wikimedia.org/r/947990 (owner: 10Marostegui) [06:09:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) [06:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T342617)', diff saved to https://phabricator.wikimedia.org/P50463 and previous config saved to /var/cache/conftool/dbconfig/20230811-061250-ladsgroup.json [06:12:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [06:18:58] (03CR) 10Ayounsi: [C: 03+1] Depool esams for duration of esams -> knams migration [dns] - 10https://gerrit.wikimedia.org/r/947945 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [06:18:59] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:23:22] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:54] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:28:04] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:55] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:31:34] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:34:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:38:56] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:43:05] (03PS1) 10Ayounsi: Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) [06:45:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:58] (03PS2) 10Ayounsi: Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) [06:51:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) [06:53:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:53:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:55:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:58:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230811T0700) [07:08:56] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:16:20] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:37] (03CR) 10Jelto: [C: 03+2] gitlab_runner: add sonar-scanner-cli image to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/947832 (https://phabricator.wikimedia.org/T343975) (owner: 10Jelto) [07:18:17] (03PS1) 10Ayounsi: Routinator: use tmpfs for cache directory [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) [07:18:41] (03CR) 10CI reject: [V: 04-1] Routinator: use tmpfs for cache directory [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [07:20:24] (03PS2) 10Ayounsi: Routinator: use tmpfs for cache directory [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) [07:23:50] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [07:23:55] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T342617)', diff saved to https://phabricator.wikimedia.org/P50464 and previous config saved to /var/cache/conftool/dbconfig/20230811-072559-ladsgroup.json [07:26:03] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [07:28:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) a:03ayounsi @MoritzMuehlenhoff is it ok to bump the RAM from 4G to 6G on the rpki* VMs? https://netbox.wikimedia.org/virtualization/virtual-machines/?q=rpki [07:31:00] (03PS1) 10Giuseppe Lavagetto: wikifunctions: add / to the route for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948080 [07:32:14] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T342617)', diff saved to https://phabricator.wikimedia.org/P50465 and previous config saved to /var/cache/conftool/dbconfig/20230811-073257-ladsgroup.json [07:33:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [07:33:55] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:34:56] (03CR) 10Legoktm: [C: 03+2] admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905 (owner: 10Legoktm) [07:35:00] (03PS2) 10Legoktm: admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905 [07:37:38] (03PS1) 10Ayounsi: Don't advertise small nets to customers [homer/public] - 10https://gerrit.wikimedia.org/r/948081 (https://phabricator.wikimedia.org/T340448) [07:37:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10MoritzMuehlenhoff) >>! In T300955#9086089, @ayounsi wrote: > @MoritzMuehlenhoff is it ok to bump the RAM from 4G to 6G on the rpki* VMs? https://netbox.wikimedia.org/v... [07:37:59] (03PS1) 10David Caro: prometheus: fix typo in job name [puppet] - 10https://gerrit.wikimedia.org/r/948083 (https://phabricator.wikimedia.org/T343885) [07:38:23] (03CR) 10CI reject: [V: 04-1] prometheus: fix typo in job name [puppet] - 10https://gerrit.wikimedia.org/r/948083 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [07:38:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:39:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Implement better filter on BGP_Customer_out - https://phabricator.wikimedia.org/T340448 (10ayounsi) a:03ayounsi [07:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P50466 and previous config saved to /var/cache/conftool/dbconfig/20230811-074105-ladsgroup.json [07:42:31] (03PS1) 10David Caro: cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) [07:42:51] (03PS4) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) [07:46:14] (03PS2) 10David Caro: cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) [07:46:47] (03CR) 10CI reject: [V: 04-1] cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [07:47:55] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM rpki2002.codfw.wmnet [07:48:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P50467 and previous config saved to /var/cache/conftool/dbconfig/20230811-074803-ladsgroup.json [07:48:12] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@23938.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ops-monitoring-bot) VM rpki2002.codfw.wmnet rebooted by ayounsi@cumin1001 with reason: None [07:48:55] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:49:20] RECOVERY - Disk space on config-master1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master1001&var-datasource=eqiad+prometheus/ops [07:50:08] (03PS3) 10David Caro: cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) [07:51:26] PROBLEM - confd service on config-master2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:51:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM rpki2002.codfw.wmnet [07:51:55] (03PS4) 10David Caro: cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) [07:52:26] PROBLEM - confd service on config-master1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:53:14] (03CR) 10Muehlenhoff: [C: 03+2] Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:54:08] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM rpki1001.eqiad.wmnet [07:54:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ops-monitoring-bot) VM rpki1001.eqiad.wmnet rebooted by ayounsi@cumin1001 with reason: bump ram to 6g [07:55:15] (03CR) 10Ayounsi: "RAM bumped to 6G on both hosts." [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [07:55:26] (03PS5) 10David Caro: cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) [07:55:36] (03PS3) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) [07:56:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P50468 and previous config saved to /var/cache/conftool/dbconfig/20230811-075612-ladsgroup.json [07:56:35] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:57:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM rpki1001.eqiad.wmnet [07:58:49] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:01:35] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:02:56] 10SRE-swift-storage, 10collaboration-services: Puppet run fails on gitlab-prod-1002.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T344042 (10Jelto) [08:03:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P50469 and previous config saved to /var/cache/conftool/dbconfig/20230811-080309-ladsgroup.json [08:04:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve2001.codfw.wmnet with reason: Expand the kubelet disk partition [08:04:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve2001.codfw.wmnet with reason: Expand the kubelet disk partition [08:04:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:05:14] RECOVERY - Disk space on config-master2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master2001&var-datasource=codfw+prometheus/ops [08:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T342617)', diff saved to https://phabricator.wikimedia.org/P50470 and previous config saved to /var/cache/conftool/dbconfig/20230811-081118-ladsgroup.json [08:11:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:11:23] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:11:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:11:35] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T342617)', diff saved to https://phabricator.wikimedia.org/P50471 and previous config saved to /var/cache/conftool/dbconfig/20230811-081139-ladsgroup.json [08:16:35] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T342617)', diff saved to https://phabricator.wikimedia.org/P50472 and previous config saved to /var/cache/conftool/dbconfig/20230811-081815-ladsgroup.json [08:18:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:18:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:18:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:27:04] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948083 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [08:28:45] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42836/console" [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [08:31:48] !log restart kubelet on ml-serve1001 - T343900 [08:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:53] T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 [08:32:09] !log expand kubelet partition on ml-serve2001 - T339231 [08:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:12] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [08:34:39] !log installing intel-microcode security updates on bookworm/bullseye [08:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:27] 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) 05Open→03Resolved a:03elukey After the kubelet restart the metric cleared! [08:46:07] (03PS1) 10David Caro: dns::dotls: expose and gather haproxy internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) [08:47:56] (03PS2) 10David Caro: dns::dotls: expose and gather haproxy internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) [08:49:26] (03CR) 10David Caro: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [08:50:13] (03PS3) 10David Caro: dns::dotls: expose and gather haproxy internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) [08:50:21] (03CR) 10David Caro: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [08:51:02] (03PS1) 10Elukey: admin_ng: bump cpu limits for calico to avoid throttling in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/948088 [08:53:54] (03CR) 10Klausman: [C: 03+1] admin_ng: bump cpu limits for calico to avoid throttling in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/948088 (owner: 10Elukey) [08:56:04] (03CR) 10Elukey: [C: 03+2] admin_ng: bump cpu limits for calico to avoid throttling in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/948088 (owner: 10Elukey) [08:57:22] (03CR) 10Cathal Mooney: [C: 03+2] Depool esams for duration of esams -> knams migration [dns] - 10https://gerrit.wikimedia.org/r/947945 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [08:59:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:59:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:00:27] (03CR) 10David Caro: dns::dotls: expose and gather haproxy internal metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:00:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:00:39] !log depool esams site until next week for knams POP migration / rebuild [09:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:03:49] (03PS1) 10Btullis: Add the lockfile-progs package require by hdfs-balancer [puppet] - 10https://gerrit.wikimedia.org/r/948089 (https://phabricator.wikimedia.org/T344045) [09:04:47] (03CR) 10Jbond: [C: 03+1] admin: add user tsev to group restricted [puppet] - 10https://gerrit.wikimedia.org/r/947957 (https://phabricator.wikimedia.org/T343596) (owner: 10Eevans) [09:05:08] (03PS1) 10EoghanGaffney: gitlab: Add default value for thanos swift key [puppet] - 10https://gerrit.wikimedia.org/r/948090 (https://phabricator.wikimedia.org/T344042) [09:05:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:06:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42839/console" [puppet] - 10https://gerrit.wikimedia.org/r/948089 (https://phabricator.wikimedia.org/T344045) (owner: 10Btullis) [09:06:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:06:53] (03PS2) 10Btullis: Add the lockfile-progs package required by hdfs-balancer [puppet] - 10https://gerrit.wikimedia.org/r/948089 (https://phabricator.wikimedia.org/T344045) [09:07:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [09:08:29] (03CR) 10Muehlenhoff: [C: 03+1] "I wonder how this ever worked before; on an-laucher1002 there's no other package with a dependency on lockfile-progs, so it's also not a c" [puppet] - 10https://gerrit.wikimedia.org/r/948089 (https://phabricator.wikimedia.org/T344045) (owner: 10Btullis) [09:10:37] (03CR) 10Btullis: [C: 03+2] Add the lockfile-progs package required by hdfs-balancer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948089 (https://phabricator.wikimedia.org/T344045) (owner: 10Btullis) [09:11:50] (03PS1) 10Elukey: admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) [09:14:24] (03PS1) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) [09:16:03] (03CR) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:17:09] (03CR) 10CI reject: [V: 04-1] thumbor: expose and fetch metrics from haproxy internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:17:21] (03PS2) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) [09:17:23] (03CR) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:19:58] (03CR) 10CI reject: [V: 04-1] thumbor: expose and fetch metrics from haproxy internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:25:18] (03CR) 10Cathal Mooney: [C: 03+1] "makes sense!" [homer/public] - 10https://gerrit.wikimedia.org/r/948081 (https://phabricator.wikimedia.org/T340448) (owner: 10Ayounsi) [09:27:10] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, as long as the filter works with the confed AS but I think the as-path test is valid." [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi) [09:27:25] (03CR) 10Cathal Mooney: [C: 03+1] esams/knams: stop anycast advertisments [homer/public] - 10https://gerrit.wikimedia.org/r/947856 (owner: 10Ayounsi) [09:27:27] (03CR) 10Cathal Mooney: [C: 03+2] esams/knams: stop anycast advertisments [homer/public] - 10https://gerrit.wikimedia.org/r/947856 (owner: 10Ayounsi) [09:28:01] (03Merged) 10jenkins-bot: esams/knams: stop anycast advertisments [homer/public] - 10https://gerrit.wikimedia.org/r/947856 (owner: 10Ayounsi) [09:28:43] (03CR) 10JMeybohm: [C: 03+1] admin_ng: increase resources for calico on wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/948091 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey) [09:29:13] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! I like the approach overall with the various modules looks easy to work with." [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [09:31:06] !log Withdrawing anycast prefixes 198.35.27.0/24 (authdns), 185.71.138.0/24 & 2001:67c:930::/48 (wikidough) from esams/knams in BGP [09:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] (03CR) 10Ayounsi: [C: 03+2] Routinator: use tmpfs for cache directory [puppet] - 10https://gerrit.wikimedia.org/r/948079 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [09:33:34] (03PS4) 10David Caro: dns::dotls: expose and gather haproxy internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) [09:33:36] (03CR) 10David Caro: dns::dotls: expose and gather haproxy internal metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:34:11] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:34:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T342617)', diff saved to https://phabricator.wikimedia.org/P50473 and previous config saved to /var/cache/conftool/dbconfig/20230811-093412-ladsgroup.json [09:34:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:34:27] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Make more Ferm-specific setup conditional to the ferm provider [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:35:34] (03CR) 10Hnowlan: "Thanks for the work on this - unfortunately it'll probably be replaced by this change which is being merged next week https://gerrit.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:37:18] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS38930/IPv4: Idle - Fiberring https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:11] (03PS1) 10Jbond: configmaster: add python3-conftool [puppet] - 10https://gerrit.wikimedia.org/r/948093 (https://phabricator.wikimedia.org/T341717) [09:40:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42843/console" [puppet] - 10https://gerrit.wikimedia.org/r/948093 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:40:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:41:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:41:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T342617)', diff saved to https://phabricator.wikimedia.org/P50474 and previous config saved to /var/cache/conftool/dbconfig/20230811-094118-ladsgroup.json [09:41:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:41:35] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] configmaster: add python3-conftool [puppet] - 10https://gerrit.wikimedia.org/r/948093 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:43:15] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) [09:44:17] (03CR) 10Sergio Gimeno: [C: 04-1] "Scheduled for Wednesday August 16, T308136#9084142." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [09:45:36] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 527 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:48:20] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-tmpfiles-clean.service,user-runtime-dir@23938.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:42] * jbond you should be silent ... [09:49:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P50475 and previous config saved to /var/cache/conftool/dbconfig/20230811-094918-ladsgroup.json [09:49:38] maybe the alert changed, in which case it may alert again. Have you considered using the hiera key instead? [09:49:55] (03CR) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:50:16] (03Abandoned) 10David Caro: thumbor: expose and fetch metrics from haproxy internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/948092 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:50:52] (03CR) 10David Caro: [V: 03+1 C: 03+2] prometheus: gather stats from haproxy for openstack and cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:50:55] jynus: i had forgoten to do in icinga as well shuld all be done now. fyi i dont think the key works for alertmanager but could be wrong [09:51:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [09:51:06] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:51:32] jbond: it should- but if it doesn't we should ask it to obs [09:51:35] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:52:23] * jynus wondering the differences between harpoxy and haproxy [09:52:58] harpoxy sounds like a terrible disease [09:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T342617)', diff saved to https://phabricator.wikimedia.org/P50476 and previous config saved to /var/cache/conftool/dbconfig/20230811-095651-ladsgroup.json [09:56:56] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:58:47] (03PS3) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) [09:59:03] (03CR) 10CI reject: [V: 04-1] firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:59:06] RECOVERY - confd service on config-master1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:35] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P50477 and previous config saved to /var/cache/conftool/dbconfig/20230811-100424-ladsgroup.json [10:05:28] (03CR) 10Jcrespo: [C: 03+1] prometheus: fix typo in job name [puppet] - 10https://gerrit.wikimedia.org/r/948083 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:05:55] (03CR) 10David Caro: [C: 03+2] prometheus: fix typo in job name [puppet] - 10https://gerrit.wikimedia.org/r/948083 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:05:57] (03PS1) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) [10:07:10] (03PS6) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [10:08:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [10:09:24] (03PS2) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) [10:10:31] (03PS1) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/948097 (https://phabricator.wikimedia.org/T336497) [10:11:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42845/console" [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:11:35] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P50478 and previous config saved to /var/cache/conftool/dbconfig/20230811-101157-ladsgroup.json [10:14:38] (03CR) 10Muehlenhoff: "Looks good, two nits/questions inline." [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:28] there is some prometheus puppet syntax error ongoing, I think [10:16:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/948097 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:16:54] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) >>! In T342968#9085732, @Eevans wrote: > @darthmon_wmde this seems to be the same key used to access Wikimedia Cloud Services. Could you please generate a separ... [10:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T342617)', diff saved to https://phabricator.wikimedia.org/P50479 and previous config saved to /var/cache/conftool/dbconfig/20230811-101930-ladsgroup.json [10:19:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [10:19:35] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:19:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [10:19:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:20:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:20:08] (03PS3) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) [10:20:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T342617)', diff saved to https://phabricator.wikimedia.org/P50480 and previous config saved to /var/cache/conftool/dbconfig/20230811-102009-ladsgroup.json [10:20:11] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:22:28] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) [10:24:14] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) I copied it on my wiki profile https://www.mediawiki.org/wiki/User:Monica_Pinedo_Bajo_(WMDE) [10:24:19] (03PS1) 10David Caro: openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) [10:26:01] (03CR) 10CI reject: [V: 04-1] openstack: use the haproxy internal stat for alerts [alerts] - 10https://gerrit.wikimedia.org/r/948098 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P50481 and previous config saved to /var/cache/conftool/dbconfig/20230811-102704-ladsgroup.json [10:37:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:40:40] (03CR) 10Umherirrender: Downgrade Parsoid in wmf.20 (031 comment) [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [10:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T342617)', diff saved to https://phabricator.wikimedia.org/P50482 and previous config saved to /var/cache/conftool/dbconfig/20230811-104210-ladsgroup.json [10:42:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:42:14] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:42:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:42:35] (03CR) 10Jbond: [C: 03+2] confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948096 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:43:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [10:49:06] (03PS2) 10Stevemunene: idp_test: add datahub_staging as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) [10:49:28] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:49:46] * jbond looking [10:50:40] (03PS1) 10Jbond: Revert "confd: use a timer to clean old files instead of tidy" [puppet] - 10https://gerrit.wikimedia.org/r/948106 [10:50:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "confd: use a timer to clean old files instead of tidy" [puppet] - 10https://gerrit.wikimedia.org/r/948106 (owner: 10Jbond) [10:52:07] (03PS1) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) [10:52:30] (03CR) 10CI reject: [V: 04-1] confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:53:20] (03PS1) 10David Caro: labweb: use a valid host for the probes [puppet] - 10https://gerrit.wikimedia.org/r/948102 [10:53:26] (03PS2) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) [10:53:47] (03CR) 10CI reject: [V: 04-1] confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:54:39] (03PS3) 10Jbond: confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) [10:55:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42846/console" [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:55:57] (03PS2) 10David Caro: labweb: use a valid host for the probes [puppet] - 10https://gerrit.wikimedia.org/r/948102 [10:56:55] (03CR) 10David Caro: [V: 03+1 C: 03+2] cloudlb: allow access to haproxy stats from prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948084 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [10:57:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:32] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:59:18] (03CR) 10Jbond: [C: 03+2] netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [11:01:55] (03CR) 10Jcrespo: [C: 03+1] netbox: expand device schema to include optional platform [puppet] - 10https://gerrit.wikimedia.org/r/931930 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [11:02:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [11:03:51] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:04:28] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:07:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42847/console" [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [11:11:35] (JobUnavailable) firing: Reduced availability for job cloudlb_haproxy in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:13:51] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:13:56] (03CR) 10Jbond: [C: 03+2] confd: use a timer to clean old files instead of tidy [puppet] - 10https://gerrit.wikimedia.org/r/948107 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [11:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T342617)', diff saved to https://phabricator.wikimedia.org/P50484 and previous config saved to /var/cache/conftool/dbconfig/20230811-111631-ladsgroup.json [11:16:35] (JobUnavailable) resolved: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:16:36] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:18:29] (03CR) 10Jbond: [C: 03+1] "LGTM you will need to also update the private repo to add the secret" [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [11:21:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [11:24:14] (03PS1) 10AikoChou: ml-services: update revert-risk images and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) [11:26:23] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on 16 hosts with reason: Downtime esams network kit prior to migration week. [11:26:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:26:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on 16 hosts with reason: Downtime esams network kit prior to migration week. [11:26:54] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=38da099c-b11e-4c93-9a6a-c2e187e2ce56) set by cmooney@cumin1001 for 10 days, 0:00:00 on 16 host(s) and their services wi... [11:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P50485 and previous config saved to /var/cache/conftool/dbconfig/20230811-113138-ladsgroup.json [11:35:31] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on 29 hosts with reason: Downtime esams hosts prior to migration week. [11:35:59] (03CR) 10Klausman: [C: 03+1] ml-services: update revert-risk images and model binary (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [11:36:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on 29 hosts with reason: Downtime esams hosts prior to migration week. [11:36:10] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9ceb85f0-e62a-4650-9f82-d05546b417db) set by cmooney@cumin1001 for 10 days, 0:00:00 on 29 host(s) and their services wi... [11:39:24] (03PS1) 10David Caro: cloudlb: move to wmcs prometheus [puppet] - 10https://gerrit.wikimedia.org/r/948104 (https://phabricator.wikimedia.org/T343885) [11:41:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [12:04:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) 05Open→03Resolved All done! [12:04:09] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [12:05:02] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4045.ulsfo.wmnet with OS bullseye [12:05:03] (03CR) 10JMeybohm: [C: 03+1] Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [12:05:42] (03CR) 10JMeybohm: [C: 03+1] Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [12:06:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:06:47] (SystemdUnitFailed) firing: (2) clean-confd-rundir.service Failed on elastic2077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:13] RECOVERY - config-master.wikimedia.org tls expiry on config-master1001 is OK: OK - Certificate config-master.wikimedia.org will expire on Tue 29 Aug 2023 05:24:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:10:21] RECOVERY - config-master.wikimedia.org tls expiry on config-master2001 is OK: OK - Certificate config-master.wikimedia.org will expire on Tue 29 Aug 2023 05:28:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:10:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [12:12:13] PROBLEM - Check systemd state on install5002 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:05] PROBLEM - Check systemd state on wdqs2016 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:14:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:17:45] (03CR) 10Muehlenhoff: [C: 03+2] "(Was reviewed in 945573)" [puppet] - 10https://gerrit.wikimedia.org/r/948097 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:22:17] (03Abandoned) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:23:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/948105 (owner: 10Jbond) [12:25:17] (03PS1) 10Jbond: pybal: update script to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) [12:26:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42850/console" [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [12:26:36] (03PS2) 10JMeybohm: CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978) [12:26:38] (03PS4) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [12:26:40] (03PS1) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) [12:30:07] RECOVERY - Check systemd state on elastic2077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:19] RECOVERY - Check systemd state on backup1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:29] RECOVERY - Check systemd state on db2108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:29] RECOVERY - Check systemd state on install5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:35] RECOVERY - Check systemd state on ganeti4008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:49] RECOVERY - Check systemd state on wdqs2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:01] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:31] RECOVERY - Check systemd state on dispatch-be1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:42] (SystemdUnitFailed) firing: (2) clean-confd-rundir.service Failed on elastic2077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:58] (03PS2) 10Jbond: pybal: update script to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) [12:33:09] PROBLEM - Check systemd state on ganeti2029 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:23] PROBLEM - Check systemd state on aux-k8s-worker1002 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:43] PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:07] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:41] (03PS3) 10Jbond: pybal: update script to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) [12:36:42] (SystemdUnitFailed) resolved: (2) clean-confd-rundir.service Failed on elastic2077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:54] * jbond looking [12:41:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:45:09] RECOVERY - Check systemd state on ganeti2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:23] RECOVERY - Check systemd state on aux-k8s-worker1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:41] RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:53] (03PS1) 10Muehlenhoff: Add ganeti config for knams [puppet] - 10https://gerrit.wikimedia.org/r/948129 [12:46:03] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:14] (03PS2) 10Muehlenhoff: Add ganeti config for knams [puppet] - 10https://gerrit.wikimedia.org/r/948129 [12:46:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:46:30] (03CR) 10Elukey: [C: 03+1] "Checked images on the registry and binary on swift, all good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [12:47:23] (03CR) 10JMeybohm: "I left mw-misc alone for now. Do you think we should remove the limits there was well for consistency?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [12:51:47] (03PS1) 10Muehlenhoff: Extend ganeti Netbox sync for new knams hosts [puppet] - 10https://gerrit.wikimedia.org/r/948130 [12:54:44] (03CR) 10JMeybohm: "This increases the limits for mw namespaces in staging as well. But as we don't deploy there I'd say we don't need to bother" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [12:54:54] (03PS5) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [12:54:56] (03PS2) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) [12:58:26] (03CR) 10Jaime Nuche: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447) (owner: 10Jaime Nuche) [12:58:48] (03CR) 10Ssingh: [C: 03+1] nginx: ensure we manage the nginx buffer directory before mountint [puppet] - 10https://gerrit.wikimedia.org/r/948105 (owner: 10Jbond) [13:01:07] !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet [13:01:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:03:07] (03PS1) 10Jbond: config-master: add conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/948133 (https://phabricator.wikimedia.org/T341717) [13:03:44] (03PS1) 10Krinkle: selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 [13:05:18] (03CR) 10Jbond: [C: 03+2] config-master: add conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/948133 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:05:27] (03PS2) 10Jbond: config-master: add conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/948133 (https://phabricator.wikimedia.org/T341717) [13:06:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] nginx: ensure we manage the nginx buffer directory before mountint [puppet] - 10https://gerrit.wikimedia.org/r/948105 (owner: 10Jbond) [13:06:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:08:53] (03PS4) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 [13:10:22] (03CR) 10CDanis: [C: 03+1] aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [13:11:22] (03CR) 10CDanis: [C: 03+1] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [13:11:44] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved [13:11:55] (03CR) 10Krinkle: [C: 03+2] selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:12:22] yay, the WikimediaDebug version with wikifunctions.org support got released apparently \o/ [13:12:31] (03CR) 10AikoChou: [C: 03+2] "Thanks! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [13:12:35] (or made it through firefox review, or whatever else was blocking it, I’m not very familiar with the process ^^) [13:13:11] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:15] (03Merged) 10jenkins-bot: ml-services: update revert-risk images and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [13:13:17] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) @Marostegui the part has arrived. Please let me know when it is safe to do the swap. [13:13:25] Lucas_WMDE: and chrome is trying to keep me safe, heh :) https://usercontent.irccloud-cdn.com/file/Snmfftq2/image.png [13:13:40] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/948129 (owner: 10Muehlenhoff) [13:14:43] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM - assuming having both 'esams' and 'esams01' defined and pointing to same url is no issue." [puppet] - 10https://gerrit.wikimedia.org/r/948130 (owner: 10Muehlenhoff) [13:14:48] urbanecm: firefox too, that’s how I noticed it ^^ [13:14:53] urbanecm: firefox does the same [13:14:55] although it didn’t disable the extension [13:14:58] (I think) [13:15:07] so that’s interesting that chrome does that o_O [13:15:20] I think firefox installed the new version but didn’t grant it more permissions until I confirmed [13:15:28] which sounds reasonable to me [13:15:30] makes sense [13:15:34] in my case, extension disappeared [13:18:37] (03CR) 10Elukey: [C: 03+1] pybal: update script to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:18:46] (03CR) 10CI reject: [V: 04-1] selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:19:11] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) @Jhancock.wm I just powered down the host. You can proceed whenever you want. Thank you! [13:21:06] (03PS1) 10Krinkle: tests: Temporarily disable automatic running of Wdio tests in CI [extensions/Wikibase] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948109 (https://phabricator.wikimedia.org/T344032) [13:21:36] (03CR) 10Krinkle: [V: 03+2 C: 03+2] "Forcing CI, given multiple backports required to unbreak CI." [extensions/Wikibase] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948109 (https://phabricator.wikimedia.org/T344032) (owner: 10Krinkle) [13:22:15] (03PS1) 10Krinkle: tests: Temporarily disable automatic running of Wdio tests in CI [extensions/WikibaseLexeme] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948110 [13:22:27] (03CR) 10Krinkle: [V: 03+2 C: 03+2] "Forcing CI, given multiple backports required to unbreak CI" [extensions/WikibaseLexeme] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948110 (owner: 10Krinkle) [13:22:38] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:22:54] (03CR) 10Krinkle: [C: 03+2] Downgrade Parsoid in wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [13:23:15] (03CR) 10Krinkle: [C: 03+2] selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:23:23] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) >>! In T341546#9083786, @Jhancock.wm wrote: > @MoritzMuehlenhoff you nailed it. Got that updated for you. Can you confirm that it's working as expected now? Thanks, I can confirm the NIC is now present. [13:23:38] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) We have successfully transferred the first CI artefacts from the test server over to remote storage! ` # gitlab-rake gitlab:artifacts:migrate I, [2023-08-11T13:20:20.670... [13:24:12] (03CR) 10Krinkle: Downgrade Parsoid in wmf.20 (031 comment) [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [13:24:36] (03CR) 10Jbond: [C: 03+2] pybal: update script to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/948127 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:24:59] (03PS1) 10Ssingh: hiera: remove dns300[1-2] from authdns_servers and NTP pool [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) [13:26:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42852/console" [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:26:59] (03CR) 10CI reject: [V: 04-1] selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:27:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:58] (03CR) 10Krinkle: [C: 03+2] "Wikibase patches landed meanwhile." [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:32:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:37] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/948134/42854/cp3050.esams.wmnet/change.cp3050.esams.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:37:54] (03PS1) 10Ssingh: durum: add explicit require on acme_chief cert for nginx [puppet] - 10https://gerrit.wikimedia.org/r/948135 [13:38:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42855/console" [puppet] - 10https://gerrit.wikimedia.org/r/948135 (owner: 10Ssingh) [13:39:06] (03PS1) 10Elukey: changeprop: allow retries for liftwing streams with 502 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 [13:39:45] (03CR) 10CI reject: [V: 04-1] Downgrade Parsoid in wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [13:40:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T342617)', diff saved to https://phabricator.wikimedia.org/P50492 and previous config saved to /var/cache/conftool/dbconfig/20230811-134030-ladsgroup.json [13:40:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:42:48] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:43:40] (03PS2) 10Ssingh: hiera: remove dns300[1-2] from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) [13:44:39] (03Merged) 10jenkins-bot: selenium: Migrate wdio tests away from deprecated `@wdio/sync` mode [extensions/ProofreadPage] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/948108 (owner: 10Krinkle) [13:44:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42856/console" [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:44:50] (03CR) 10Krinkle: "recheck" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [13:46:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:47:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:47:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:48:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T342617)', diff saved to https://phabricator.wikimedia.org/P50493 and previous config saved to /var/cache/conftool/dbconfig/20230811-134804-ladsgroup.json [13:48:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:49:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:51:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:51:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:52:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:52:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:46] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [13:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P50494 and previous config saved to /var/cache/conftool/dbconfig/20230811-135537-ladsgroup.json [13:57:44] (03CR) 10JHathaway: [C: 03+2] site.pp: Drop top level domain names: .wmnet .org [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [13:59:30] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) [14:01:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:03:47] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:47] (03PS1) 10Ssingh: hiera: authdns_addrs: do not duplicate ns2-v4 IP [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) [14:07:44] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) 05Open→03Resolved @Marostegui I swapped out fan2. it's been powered up for 15+ minutes now and hasn't thrown an error. It looks like it's fixed so I'm gonna close the ticket. but pl... [14:09:10] (03PS2) 10Ssingh: hiera: authdns_addrs: do not duplicate ns2-v4 IP [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) [14:10:18] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42857/console" [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P50496 and previous config saved to /var/cache/conftool/dbconfig/20230811-141043-ladsgroup.json [14:11:11] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) Thank you - I just started mariadb but will leave the host depooled until Monday, just in case. [14:11:35] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:58] (03CR) 10Muehlenhoff: Extend ganeti Netbox sync for new knams hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948130 (owner: 10Muehlenhoff) [14:17:22] (03CR) 10Ssingh: [V: 03+1] "cumin:O:dnsbox output for all DNS boxes:" [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:17:50] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) [14:18:40] (03PS1) 10Elukey: ml-services: improve concurrency settings for drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/948143 (https://phabricator.wikimedia.org/T344058) [14:19:02] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) [14:19:49] (03CR) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [14:21:31] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS bullseye [14:21:38] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2008.codfw.wmnet with OS bullseye [14:23:59] (03CR) 10Urbanecm: [C: 03+1] "thanks. lgtm 😊." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [14:25:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T342617)', diff saved to https://phabricator.wikimedia.org/P50500 and previous config saved to /var/cache/conftool/dbconfig/20230811-142550-ladsgroup.json [14:25:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:25:54] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:26:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [14:26:06] (03CR) 10Elukey: [C: 03+2] ml-services: improve concurrency settings for drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/948143 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [14:26:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T342617)', diff saved to https://phabricator.wikimedia.org/P50501 and previous config saved to /var/cache/conftool/dbconfig/20230811-142611-ladsgroup.json [14:26:13] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10RhinosF1) >>! In T341272#8998430, @KFrancis wrote: > Hi all, Let me do some research and get back to you! Thanks!!! Was this done @KFrancis ? [14:27:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948135 (owner: 10Ssingh) [14:27:46] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [14:28:13] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add explicit require on acme_chief cert for nginx [puppet] - 10https://gerrit.wikimedia.org/r/948135 (owner: 10Ssingh) [14:28:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS bookworm [14:29:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:29:27] 10SRE, 10SRE-Access-Requests, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10RhinosF1) [14:29:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:31:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:31:44] (03CR) 10Klausman: [C: 03+1] ml-services: improve concurrency settings for drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/948143 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [14:32:08] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:34:08] BFD/BGP alerts in eqiad expected [14:34:15] well, alert [14:35:45] (03CR) 10Klausman: [C: 03+1] ml-services: update revert-risk images and model binary (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [14:37:17] (03CR) 10JHathaway: [C: 03+2] Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [14:37:29] (03CR) 10Elukey: [C: 03+1] ml-services: update revert-risk images and model binary (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948103 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [14:38:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:38:41] 10SRE, 10SRE-Access-Requests, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10pfischer) Sorry for that, here my new one: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFjit1wuGOgmJNp2JMHRcy6LyfJWXwjacns04JXyXyLA pfischer production ` [14:40:23] (03CR) 10Bking: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/948111 (https://phabricator.wikimedia.org/T344059) (owner: 10RhinosF1) [14:40:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10RhinosF1) p:05Triage→03High [14:40:56] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage [14:41:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [14:42:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [14:43:34] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10Jhancock.wm) [14:44:13] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10Jhancock.wm) servers have been physically relabeled. [14:44:27] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2008.codfw.wmnet with reason: host reimage [14:45:05] (03PS1) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [14:45:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10RhinosF1) @pfischer: replaced by @bking and will rollout within 30 minutes. Please test the new key and confirm working by marking the task as resolved. [14:45:30] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jhancock.wm) 05Open→03Resolved [14:45:39] (03PS1) 10JHathaway: 1.1.3: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948148 [14:46:17] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) 05Open→03Resolved [14:47:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [14:47:21] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) >>! In T342968#9086433, @darthmon_wmde wrote: >>>! In T342968#9085732, @Eevans wrote: >> @darthmon_wmde this seems to be the same key used to access Wikimedia Cloud Ser... [14:48:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948148 (owner: 10JHathaway) [14:48:15] (03CR) 10JHathaway: [C: 03+2] 1.1.3: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948148 (owner: 10JHathaway) [14:48:52] (03PS2) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [14:49:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [14:53:31] (03PS3) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [14:53:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124 [14:53:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124 [14:53:55] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [14:57:50] (03PS6) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [14:58:56] (03PS4) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [15:01:04] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177 [15:01:09] !log bking@deploy1002 deploy aborted: f1a6177 (duration: 00m 05s) [15:01:33] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177 [15:02:24] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 50s) [15:03:32] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10bking) [15:03:53] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2008.codfw.wmnet with OS bullseye [15:05:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:05:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:06:26] (03PS5) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [15:06:30] (03PS1) 10JHathaway: puppet-lint-wmf_styleguide-check: bump to 1.1.3 [puppet] - 10https://gerrit.wikimedia.org/r/948152 (https://phabricator.wikimedia.org/T342806) [15:07:07] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2009.codfw.wmnet with OS bullseye [15:07:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/948152 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [15:07:27] (03CR) 10JHathaway: [C: 03+2] puppet-lint-wmf_styleguide-check: bump to 1.1.3 [puppet] - 10https://gerrit.wikimedia.org/r/948152 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [15:08:02] (03PS6) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [15:08:13] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [15:08:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [15:08:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:32] 10SRE, 10SRE-Access-Requests, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10pfischer) I can confirm access with my new SSH key. Thank you! [15:15:41] (03PS7) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [15:16:05] 10SRE, 10SRE-Access-Requests, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10RhinosF1) 05Open→03Resolved [15:17:28] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177 [15:18:10] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 42s) [15:18:17] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:23:30] !log bking@deploy1002 'deploying WDQS on newly-reimaged Bullseye hosts T343124' [15:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [15:24:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T342617)', diff saved to https://phabricator.wikimedia.org/P50502 and previous config saved to /var/cache/conftool/dbconfig/20230811-152433-ladsgroup.json [15:24:38] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:26:20] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [15:27:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [15:27:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:41] (03CR) 10Ahmon Dancy: [C: 03+1] releases jenkins: allow Scap to disable services on secondary hosts [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447) (owner: 10Jaime Nuche) [15:30:21] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good to my untrained eyes! PCC changes seem fine overall." [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:36:23] (03PS8) 10Jbond: README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 [15:37:01] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [15:37:05] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [15:37:23] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 22s) [15:39:28] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Papaul) interfaces are disable ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/15 descriptions Interface Admin Link Description ge-0/0/... [15:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P50503 and previous config saved to /var/cache/conftool/dbconfig/20230811-153941-ladsgroup.json [15:41:18] 10SRE, 10SRE-Access-Requests, 10Security: WMF Prod key used in WMCS - https://phabricator.wikimedia.org/T344059 (10bking) Confirmed @pfischer 's identity via Slack conversation. [15:45:00] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Papaul) interfaces disable ` papaul@fasw-c-codfw# run show interfaces descriptions ge-[0-1]/0/16 Interface Admin Link Description ge-0/0/16... [15:54:15] (03PS1) 10Jbond: bird: recursively manage the log file [puppet] - 10https://gerrit.wikimedia.org/r/948157 [15:54:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P50504 and previous config saved to /var/cache/conftool/dbconfig/20230811-155447-ladsgroup.json [15:54:50] (03PS2) 10Jbond: bird: recursively manage the log file [puppet] - 10https://gerrit.wikimedia.org/r/948157 [15:54:56] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:55:56] (03CR) 10Ssingh: [C: 03+1] "Thanks for the CR and for catching it! We will try with a fresh reimage to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/948157 (owner: 10Jbond) [15:56:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42859/console" [puppet] - 10https://gerrit.wikimedia.org/r/948157 (owner: 10Jbond) [15:56:20] (03CR) 10Jbond: [C: 03+2] bird: recursively manage the log file [puppet] - 10https://gerrit.wikimedia.org/r/948157 (owner: 10Jbond) [15:58:31] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [15:58:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:58:45] PROBLEM - SSH on thanos-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:58:49] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:59:53] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:00:07] RECOVERY - SSH on thanos-fe1003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:00:11] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:03:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:04:24] (03CR) 10JHathaway: [C: 03+1] "thanks!" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 (owner: 10Jbond) [16:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T342617)', diff saved to https://phabricator.wikimedia.org/P50505 and previous config saved to /var/cache/conftool/dbconfig/20230811-160453-ladsgroup.json [16:04:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:06:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:09:32] 10SRE, 10serviceops, 10MediaWiki-Platform-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10Krinkle) [16:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T342617)', diff saved to https://phabricator.wikimedia.org/P50506 and previous config saved to /var/cache/conftool/dbconfig/20230811-160953-ladsgroup.json [16:09:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:09:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:10:15] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:10:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T342617)', diff saved to https://phabricator.wikimedia.org/P50507 and previous config saved to /var/cache/conftool/dbconfig/20230811-161025-ladsgroup.json [16:11:15] (03PS1) 10Jbond: bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 [16:11:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:11:50] (03CR) 10CI reject: [V: 04-1] bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [16:15:01] (03CR) 10Krinkle: [C: 03+2] "Confiremd to match what's in mediawiki-vendor for this wmf branch." [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [16:15:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bookworm [16:16:01] (03PS2) 10Jbond: bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 [16:16:34] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: remove dns300[1-2] from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/948134 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [16:16:41] (03CR) 10CI reject: [V: 04-1] bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [16:17:36] !log running agent on A:cumin or A:dns-rec or A:netbox to remove dns300x from authdns_servers: T329219 [16:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:40] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [16:18:00] (03CR) 10Jbond: [C: 03+2] README.release: update release guide [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/948147 (owner: 10Jbond) [16:19:08] (03PS1) 10Jbond: test wmf_stylguide update [puppet] - 10https://gerrit.wikimedia.org/r/948159 [16:19:46] (03CR) 10CI reject: [V: 04-1] test wmf_stylguide update [puppet] - 10https://gerrit.wikimedia.org/r/948159 (owner: 10Jbond) [16:20:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P50508 and previous config saved to /var/cache/conftool/dbconfig/20230811-161959-ladsgroup.json [16:23:38] !log running dummy authdns-update [16:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: authdns_addrs: do not duplicate ns2-v4 IP [puppet] - 10https://gerrit.wikimedia.org/r/948142 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [16:25:39] 10SRE, 10Infrastructure-Foundations, 10netops: Add per-output queue graphing for Juniper network devices in LibreNMS - https://phabricator.wikimedia.org/T326322 (10ayounsi) Next steps here: * Decide which hosts will run gnmic, I can think of 4 options: ** netflowXXXX (my preferred option, as already monitori... [16:26:39] (03PS3) 10JHathaway: dev env: hiera data [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) [16:26:55] 10SRE, 10Infrastructure-Foundations, 10netops: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) [16:27:03] !log running agent on A:dns-rec to remove ns2-v4 IP: T329219 [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:07] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [16:30:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:30:32] (03Merged) 10jenkins-bot: Downgrade Parsoid in wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947989 (https://phabricator.wikimedia.org/T344032) (owner: 10Tim Starling) [16:32:28] !log running dummy authdns-update [16:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:22] PROBLEM - Auth DNS #page on ns2-v4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [16:33:28] oh uh [16:33:32] this should have been removed [16:33:52] can someone ACK it [16:33:58] it's expected [16:34:17] ack [16:34:18] the second removal should have taken care of it [16:34:48] I'm downtiming until Monday [16:34:59] just a second, sorry [16:35:01] want to check something [16:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P50510 and previous config saved to /var/cache/conftool/dbconfig/20230811-163506-ladsgroup.json [16:35:22] (03PS1) 10Ayounsi: Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) [16:35:35] (03CR) 10CI reject: [V: 04-1] Revert "prometheus::ops: add demo node exporter job for SONiC" [puppet] - 10https://gerrit.wikimedia.org/r/948112 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [16:35:38] sukhe: ack waiting [16:35:42] thank you [16:35:45] checking [16:36:42] (03PS1) 10Ayounsi: Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T326322) [16:36:50] (03CR) 10CI reject: [V: 04-1] Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [16:37:44] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) After more investigation, I'm going to roll out gNMIc for more real life testing. As it's multi-platform and should export the... [16:38:05] /Stage[main]/Icinga/Nagios_service[alert1001 ns2-v4]/ensure [16:38:06] removed [16:38:14] wondering why it didn't get removed though [16:38:57] I mean we can downtime it sure but it should not be even there [16:40:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:30] cam someone restart apache on lists1001, it is renewed - long standing bug [16:41:32] cdanis: are you on-call? asking because topic says batphone but I didn't get the page [16:41:41] sukhe: don't trust the topic :( [16:41:45] oh sorry [16:41:47] I see that you are [16:41:54] umm that's interesting [16:41:55] ok, please forgive me if it page.s again [16:41:57] klaxon isn't showing anyone oncall [16:42:04] sigh [16:42:06] nothing to worry at all [16:42:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:42:19] (03PS7) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [16:42:21] (03PS9) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [16:42:23] (03PS4) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [16:42:25] (03PS1) 10Andrew Bogott: Remove backup settings from a bunch of cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/948160 [16:42:27] (03PS1) 10Andrew Bogott: backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) [16:43:18] sukhe: did you run puppet on alert2001 [16:43:21] yep [16:43:29] 1001 as well [16:43:32] hmm [16:43:41] it even shows the resource removed [16:43:46] 12:38:05 < sukhe> /Stage[main]/Icinga/Nagios_service[alert1001 ns2-v4]/ensure [16:43:49] 12:38:06 < sukhe> removed [16:43:51] this [16:46:02] did icinga reload its configuration successfully [16:46:23] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [16:46:51] (03CR) 10Andrew Bogott: [C: 03+2] Remove backup settings from a bunch of cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/948160 (owner: 10Andrew Bogott) [16:47:46] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 (owner: 10Andrew Bogott) [16:47:47] yeah, seems to be all fine there [16:47:54] just can't see why this persists [16:47:55] looking [16:48:55] (03CR) 10CI reject: [V: 04-1] backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [16:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T342617)', diff saved to https://phabricator.wikimedia.org/P50511 and previous config saved to /var/cache/conftool/dbconfig/20230811-165013-ladsgroup.json [16:50:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:50:17] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:50:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:50:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50512 and previous config saved to /var/cache/conftool/dbconfig/20230811-165033-ladsgroup.json [16:51:22] (03PS1) 10Eevans: admin: new ssh key for user darthmon [puppet] - 10https://gerrit.wikimedia.org/r/948164 (https://phabricator.wikimedia.org/T342968) [16:52:14] (03CR) 10Cathal Mooney: [C: 03+2] Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [16:52:28] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:52:32] (03CR) 10Eevans: [C: 03+2] admin: new ssh key for user darthmon [puppet] - 10https://gerrit.wikimedia.org/r/948164 (https://phabricator.wikimedia.org/T342968) (owner: 10Eevans) [16:52:33] we seem to have only two hosts in O:alerting_host anyway [16:52:42] * sukhe thoroughly confused now [16:52:50] (03Merged) 10jenkins-bot: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [16:53:00] (03PS10) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [16:53:02] (03PS2) 10Andrew Bogott: backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) [16:53:04] (03PS5) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [16:54:06] (03CR) 10Andrew Bogott: [C: 03+1] labweb: use a valid host for the probes [puppet] - 10https://gerrit.wikimedia.org/r/948102 (owner: 10David Caro) [16:54:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) 05Open→03Resolved Ok, this is done; Thanks! `lang=sh-session eevans@mwmaint1002:~$ cross-validate-accounts eevans@mwmaint1002:~$ ` [16:55:54] (03CR) 10Andrew Bogott: add volumes functionality to wmcs-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [16:56:41] (03PS11) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [16:56:43] (03PS3) 10Andrew Bogott: backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) [16:56:45] (03PS6) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [17:00:36] (03CR) 10CI reject: [V: 04-1] backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [17:00:38] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [17:01:16] (03PS1) 10Ssingh: dnsrecursor: update authdns address [puppet] - 10https://gerrit.wikimedia.org/r/948167 [17:04:22] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42860/console" [puppet] - 10https://gerrit.wikimedia.org/r/948167 (owner: 10Ssingh) [17:06:18] (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense yep" [puppet] - 10https://gerrit.wikimedia.org/r/948167 (owner: 10Ssingh) [17:06:31] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: update authdns address [puppet] - 10https://gerrit.wikimedia.org/r/948167 (owner: 10Ssingh) [17:07:58] !log running agent on dns-rec to remove old ns2 IP [17:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:31] well that's not nice [17:13:52] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [17:15:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:19] (03PS4) 10Andrew Bogott: backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) [17:17:21] (03PS7) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [17:17:24] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:17:35] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:17:45] this is expected [17:19:11] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:20:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=97) rolling restart_daemons on A:wikidough and A:wikidough [17:22:04] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/948159 (owner: 10Jbond) [17:26:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:44] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [17:32:48] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [17:33:28] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 44s) [17:35:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:43:24] !log removing routing for former ns2.wikimedia.org IP 91.198.174.239 from esams CRs T343942 [17:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:09] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T342617)', diff saved to https://phabricator.wikimedia.org/P50513 and previous config saved to /var/cache/conftool/dbconfig/20230811-174851-ladsgroup.json [17:48:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:51:05] so ns2-v4 down is expected [17:51:14] the only question that remains though is how to get rid of this check [17:52:33] and we care about that for two reasons 1) so that it doesn't alert again as this will be down for a while [17:52:41] 2) if we are missing some other configuration somewhere [17:54:26] can we not run a downtime for that host? [17:54:32] (03CR) 10Jdlrobson: [C: 03+1] "(on the assumption the idea is to try this out, collect data and revert the approach makes sense to me!)" [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [17:54:41] topranks: already downtimed [17:54:48] I think with sre.hosts.downtime --force "ns2-v4" [17:54:54] oh ok [17:55:06] what I meant was that I think that downtiming is a temporary fix in that sense [17:55:28] yeah well we want it back again post-migration [17:55:40] but they question is why is it firiing I guess? [17:55:42] yes [17:56:05] If we've updated it to check 198.35.27.27 then the ping shouldn't fail [17:56:22] the check for 198.35 is already there as part of the anycast check [17:56:26] independent of this [17:56:29] that check is nsa-v4 [17:56:41] and right, ns2 should ping against the new IP too [17:56:49] and if we haven't we gotta work out how to do that before we change again to 185.15.59.231 [17:57:10] yeah, like we could leave it downtimed, cos we're checking nsa IP separately [17:57:22] but we need to know how to change the IP that check is pinging for when it changes again [17:57:46] yeah that's what is bothering me [17:57:47] can't see it [17:58:32] I am wondering if the ns2 IP is hardcoded somewhere else [17:58:35] which is another concern I have [18:02:02] !log reload icinga on alert1001 [18:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:47] ok [18:02:49] that cleared it up [18:02:50] topranks: ^ [18:02:59] nice! [18:03:10] i was going mad on this icinga server, didn't appear to be in the conf files at all [18:03:12] all good [18:03:12] phew [18:03:22] but I guess that explains it - for some reason service didn't restart following removal [18:03:23] yeah I even manually grepped :] [18:03:52] you smart people that have any other way - that was first thing I did :D [18:03:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P50514 and previous config saved to /var/cache/conftool/dbconfig/20230811-180358-ladsgroup.json [18:05:58] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:06:40] cdanis: all clear on the alerts. any alert from now on is a real alert fwiw :) [18:06:46] sorry for the noise, reloading icinga fixed it [18:08:48] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:09:22] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:10:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:34] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10KFrancis) I'm still working on this, but in the meantime, I can process an NDA for you so you can get the access you need. Please email the following to kfrancis@wikimedia.org: -Full... [18:10:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:22] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [18:13:03] (03CR) 10JHathaway: [C: 03+2] dev env: hiera data [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:14:28] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [18:14:28] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T342617)', diff saved to https://phabricator.wikimedia.org/P50515 and previous config saved to /var/cache/conftool/dbconfig/20230811-181649-ladsgroup.json [18:16:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:17:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm [18:19:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P50516 and previous config saved to /var/cache/conftool/dbconfig/20230811-181904-ladsgroup.json [18:20:30] (03CR) 10Andrew Bogott: [C: 03+2] add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [18:20:39] (03CR) 10Andrew Bogott: [C: 03+2] backy2: make backup dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/948161 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [18:21:46] (03PS8) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [18:22:16] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:23:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:24:27] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [18:26:17] (03CR) 10Ssingh: Release 1.9-4 to target bullseye (031 comment) [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:27:50] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10RhinosF1) >>! In T341272#9087279, @KFrancis wrote: > I'm still working on this, but in the meantime, I can process an NDA for you so you can get the access you need. Please email the... [18:30:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50517 and previous config saved to /var/cache/conftool/dbconfig/20230811-183008-ladsgroup.json [18:30:22] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P50518 and previous config saved to /var/cache/conftool/dbconfig/20230811-183155-ladsgroup.json [18:31:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [18:34:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T342617)', diff saved to https://phabricator.wikimedia.org/P50519 and previous config saved to /var/cache/conftool/dbconfig/20230811-183410-ladsgroup.json [18:34:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:34:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T342617)', diff saved to https://phabricator.wikimedia.org/P50520 and previous config saved to /var/cache/conftool/dbconfig/20230811-183431-ladsgroup.json [18:35:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [18:37:05] (03PS9) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [18:37:07] (03PS1) 10Andrew Bogott: Clean up a few remants of role::wmcs::openstack::eqiad1::virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/948178 [18:40:17] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [18:41:38] (03PS2) 10Andrew Bogott: Clean up a few remants of role::wmcs::openstack::eqiad1::virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/948178 [18:41:40] (03PS10) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [18:42:45] (03PS2) 10BCornwall: Release 1.9-4 to target Bookworm [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) [18:42:55] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2010.codfw.wmnet with OS bullseye [18:43:07] (03CR) 10BCornwall: "How embarrassing" [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:44:33] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [18:45:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P50521 and previous config saved to /var/cache/conftool/dbconfig/20230811-184514-ladsgroup.json [18:46:27] (03CR) 10Andrew Bogott: [C: 03+2] Clean up a few remants of role::wmcs::openstack::eqiad1::virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/948178 (owner: 10Andrew Bogott) [18:47:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P50522 and previous config saved to /var/cache/conftool/dbconfig/20230811-184701-ladsgroup.json [18:50:22] (03PS3) 10BCornwall: Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) [18:50:35] (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:52:07] (03PS4) 10BCornwall: Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) [18:52:20] (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:52:50] (03CR) 10BCornwall: Release 9.1.4-1wm2 to target Bookworm (032 comments) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:54:25] (03CR) 10BCornwall: Release 9.1.4-1wm2 to target Bookworm (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:55:40] (03PS5) 10BCornwall: Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) [18:56:01] (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:59:25] (03CR) 10Ssingh: [C: 03+1] Release 1.9-4 to target Bookworm [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [19:00:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P50523 and previous config saved to /var/cache/conftool/dbconfig/20230811-190021-ladsgroup.json [19:01:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:02:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T342617)', diff saved to https://phabricator.wikimedia.org/P50524 and previous config saved to /var/cache/conftool/dbconfig/20230811-190208-ladsgroup.json [19:02:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:02:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:02:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:03:12] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2010.codfw.wmnet with reason: host reimage [19:06:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:06:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm [19:06:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2010.codfw.wmnet with reason: host reimage [19:14:53] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [19:15:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50525 and previous config saved to /var/cache/conftool/dbconfig/20230811-191527-ladsgroup.json [19:15:29] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:15:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:15:31] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:15:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:15:44] (03CR) 10Jdlrobson: [C: 03+1] mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [19:15:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50526 and previous config saved to /var/cache/conftool/dbconfig/20230811-191548-ladsgroup.json [19:16:29] (03CR) 10Eevans: [C: 03+2] admin: add user tsev to group restricted [puppet] - 10https://gerrit.wikimedia.org/r/947957 (https://phabricator.wikimedia.org/T343596) (owner: 10Eevans) [19:21:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) 05Open→03Resolved Hi @Tsevener, this should now be done. I'm closing the ticket, but don't hesitate to reopen if you have any issues! [19:21:57] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 25964 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:33:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2010.codfw.wmnet with OS bullseye [19:34:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:37:15] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [19:38:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [19:38:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:44:25] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2011.codfw.wmnet with OS bullseye [20:01:51] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [20:01:55] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [20:02:32] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:02:33] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 41s) [20:02:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:02:51] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:03:25] (03PS3) 10BCornwall: Release 1.9-4 to target Bookworm [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) [20:04:19] (03CR) 10BCornwall: "Figured I'd fix the "extra" priority deprecation while I'm at it." [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [20:05:02] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2011.codfw.wmnet with reason: host reimage [20:08:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2011.codfw.wmnet with reason: host reimage [20:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T342617)', diff saved to https://phabricator.wikimedia.org/P50527 and previous config saved to /var/cache/conftool/dbconfig/20230811-201505-ladsgroup.json [20:15:18] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:18:23] 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Thank you for the clarification, @Vgutierrez. Do you have a suggestion on how to reconcile this? My instinct is to remove the abstraction entirely and ma... [20:18:36] 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) 05Open→03Stalled [20:18:39] 10SRE, 10Incident Tooling: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10BCornwall) [20:20:52] (03PS3) 10Ssingh: bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:21:26] (03CR) 10CI reject: [V: 04-1] bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:22:07] (03PS1) 10Cwhite: profile: drop istio-proxy deprecated field warnings [puppet] - 10https://gerrit.wikimedia.org/r/947397 (https://phabricator.wikimedia.org/T344070) [20:22:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42864/console" [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:24:27] (03PS4) 10Ssingh: bird: create dummy anycast-prefix files [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:26:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42865/console" [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:26:51] (03CR) 10Ssingh: [V: 03+1] "Did some cleanup, please check once again and let's merge on Monday and we can confirm by doing a reimage. Hopefully this it!" [puppet] - 10https://gerrit.wikimedia.org/r/948158 (owner: 10Jbond) [20:27:49] (03CR) 10Cwhite: [V: 03+1 C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/947397/42866/" [puppet] - 10https://gerrit.wikimedia.org/r/947397 (https://phabricator.wikimedia.org/T344070) (owner: 10Cwhite) [20:28:36] (03CR) 10Krinkle: "recheck" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) (owner: 10Tim Starling) [20:30:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P50528 and previous config saved to /var/cache/conftool/dbconfig/20230811-203011-ladsgroup.json [20:31:17] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2011.codfw.wmnet with OS bullseye [20:40:09] (03CR) 10Dmaza: [C: 03+1] Set wikidiff2 maxSplitSize = 10 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947986 (https://phabricator.wikimedia.org/T341754) (owner: 10Tim Starling) [20:43:22] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [20:43:25] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [20:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P50529 and previous config saved to /var/cache/conftool/dbconfig/20230811-204517-ladsgroup.json [20:46:07] !log bking@deploy1002 deploy aborted: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 02m 44s) [20:46:08] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [20:46:20] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 12s) [20:48:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:55:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50530 and previous config saved to /var/cache/conftool/dbconfig/20230811-205546-ladsgroup.json [20:55:51] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T342617)', diff saved to https://phabricator.wikimedia.org/P50531 and previous config saved to /var/cache/conftool/dbconfig/20230811-210024-ladsgroup.json [21:00:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [21:00:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [21:00:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:00:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:00:59] (03PS2) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T342617)', diff saved to https://phabricator.wikimedia.org/P50532 and previous config saved to /var/cache/conftool/dbconfig/20230811-210102-ladsgroup.json [21:01:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:01:55] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:01:57] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:05:17] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:05:41] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:06:35] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:06:37] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:08:56] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [21:10:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [21:10:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:31] (03PS3) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:10:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P50533 and previous config saved to /var/cache/conftool/dbconfig/20230811-211053-ladsgroup.json [21:11:27] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:15:47] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:16:00] (03PS4) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:16:56] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:17:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:37] (03PS5) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:21:28] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P50534 and previous config saved to /var/cache/conftool/dbconfig/20230811-212559-ladsgroup.json [21:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:39:33] (03PS6) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:40:30] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T342617)', diff saved to https://phabricator.wikimedia.org/P50535 and previous config saved to /var/cache/conftool/dbconfig/20230811-214105-ladsgroup.json [21:41:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [21:41:09] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:41:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [21:41:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:41:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:41:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T342617)', diff saved to https://phabricator.wikimedia.org/P50536 and previous config saved to /var/cache/conftool/dbconfig/20230811-214142-ladsgroup.json [21:45:18] (03PS7) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:46:14] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:48:20] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:49:38] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:50:09] (03PS8) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:51:02] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:52:36] (03PS9) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [21:53:29] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:57:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:00:02] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:01:34] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:02:39] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:03:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [22:04:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS additions esams move. - cmooney@cumin1001" [22:04:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:06:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:06:19] (03PS10) 10Cathal Mooney: Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [22:30:31] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 25964 bytes in 9.994 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:35:01] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:39:47] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 25982 bytes in 0.655 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:40:51] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 25982 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:45:33] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:45:59] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T342617)', diff saved to https://phabricator.wikimedia.org/P50537 and previous config saved to /var/cache/conftool/dbconfig/20230811-224741-ladsgroup.json [22:47:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:49:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:57:47] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [22:58:07] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [22:58:55] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 25984 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:59:17] RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2023-11-07 09:20:34 +0000 (expires in 87 days) https://phabricator.wikimedia.org/project/view/2773/ [22:59:23] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 25966 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:59:37] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2023-11-07 09:20:34 +0000 (expires in 87 days) https://phabricator.wikimedia.org/project/view/2773/ [23:00:13] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P50538 and previous config saved to /var/cache/conftool/dbconfig/20230811-230247-ladsgroup.json [23:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P50539 and previous config saved to /var/cache/conftool/dbconfig/20230811-231753-ladsgroup.json [23:20:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T342617)', diff saved to https://phabricator.wikimedia.org/P50540 and previous config saved to /var/cache/conftool/dbconfig/20230811-232043-ladsgroup.json [23:20:47] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:33:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T342617)', diff saved to https://phabricator.wikimedia.org/P50541 and previous config saved to /var/cache/conftool/dbconfig/20230811-233259-ladsgroup.json [23:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [23:33:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:33:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [23:33:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T342617)', diff saved to https://phabricator.wikimedia.org/P50542 and previous config saved to /var/cache/conftool/dbconfig/20230811-233320-ladsgroup.json [23:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P50543 and previous config saved to /var/cache/conftool/dbconfig/20230811-233549-ladsgroup.json [23:36:30] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P50544 and previous config saved to /var/cache/conftool/dbconfig/20230811-235056-ladsgroup.json [23:56:30] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota