[00:38:27] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:43:59] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 339361 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:53:13] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4042 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 61607 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [00:55:03] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikipedia.org has 338696 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:57:41] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:59:33] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 216027 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:08:35] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [01:10:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 330572 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:18:45] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [01:20:37] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4038 is OK: SSL OK - OCSP staple validity for wikipedia.org has 337162 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [02:09:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:39] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [02:24:46] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:29] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 380010 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [02:44:01] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [02:45:53] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4043 is OK: SSL OK - OCSP staple validity for wikipedia.org has 332047 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [03:20:15] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [03:23:59] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 322560 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [03:33:15] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 48405 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [03:35:05] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 321894 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [03:43:25] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [03:45:15] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 375284 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [03:47:51] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/... [03:50:07] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [03:51:57] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4042 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 205682 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:00:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [04:03:42] (03CR) 10Santhosh: [C: 03+1] Content Translation: Adjust the global limit for unedited MT to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482) (owner: 10KartikMistry) [04:05:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [04:08:41] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [04:12:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikipedia.org has 326856 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:32:25] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:31] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [04:38:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 325298 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:40:11] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:59:33] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:03:15] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 323805 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:05:01] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:06:51] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 370388 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:07:59] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:09:51] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4041 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 201008 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:24:02] . [05:28:07] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:59] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:32:51] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4040 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 199629 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:34:26] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:35:03] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.001e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:35:53] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:37:43] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikipedia.org has 321736 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:40:19] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:42:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4040 is OK: SSL OK - OCSP staple validity for wikipedia.org has 314268 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:56:15] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [05:58:05] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4041 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 367314 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:03:37] !log rsync from dumpsdata1001 in ariel screen session of xmldatadumps/private to dumpsdata1007 (did this for 1006 about an hour ago, forgot to log), no bandwidth cap [06:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:07] !log rsync from dumpsdata1001 in ariel screen session of xmldatadumps/public to dumpsdata1007, no bandwidth cap [06:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:51] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4047 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 40209 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [06:51:41] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4047 is OK: SSL OK - OCSP staple validity for wikipedia.org has 317299 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:51:53] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [06:53:45] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 317175 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:54:24] (03PS2) 10KartikMistry: Content Translation: Adjust the global limit for unedited MT to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482) [07:01:51] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [07:03:43] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 363376 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:06:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:07:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:07:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:08:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P44933 and previous config saved to /var/cache/conftool/dbconfig/20230306-070814-marostegui.json [07:08:21] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:11:15] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [07:13:05] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4040 is OK: SSL OK - OCSP staple validity for wikipedia.org has 316015 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:15:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2094.codfw.wmnet [07:15:05] (03PS1) 10Marostegui: mariadb: Decommission db2094 [puppet] - 10https://gerrit.wikimedia.org/r/894358 (https://phabricator.wikimedia.org/T330828) [07:17:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2094 [puppet] - 10https://gerrit.wikimedia.org/r/894358 (https://phabricator.wikimedia.org/T330828) (owner: 10Marostegui) [07:20:02] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:20:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:21:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:21:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:21:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T328817)', diff saved to https://phabricator.wikimedia.org/P44934 and previous config saved to /var/cache/conftool/dbconfig/20230306-072132-marostegui.json [07:21:39] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:22:10] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2094.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:23:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2094.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:23:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:23:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2094.codfw.wmnet [07:23:35] 10ops-codfw, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2094.codfw.wmnet` - db2094.codfw.wmnet (**WARN**) - Downtimed... [07:23:47] 10ops-codfw, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Marostegui) Ready for DCOps [07:24:00] 10ops-codfw, 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Marostegui) [07:24:10] 10ops-codfw, 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Marostegui) a:05Marostegui→03None [07:27:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T328817)', diff saved to https://phabricator.wikimedia.org/P44935 and previous config saved to /var/cache/conftool/dbconfig/20230306-072724-marostegui.json [07:27:31] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:31:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P44936 and previous config saved to /var/cache/conftool/dbconfig/20230306-073119-marostegui.json [07:31:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:34:45] (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:36:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:37:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:37:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P44937 and previous config saved to /var/cache/conftool/dbconfig/20230306-073707-marostegui.json [07:37:14] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:41:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P44938 and previous config saved to /var/cache/conftool/dbconfig/20230306-074125-marostegui.json [07:42:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P44939 and previous config saved to /var/cache/conftool/dbconfig/20230306-074231-marostegui.json [07:46:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P44940 and previous config saved to /var/cache/conftool/dbconfig/20230306-074626-marostegui.json [07:48:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122', diff saved to https://phabricator.wikimedia.org/P44941 and previous config saved to /var/cache/conftool/dbconfig/20230306-074830-root.json [07:48:48] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4042 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 33072 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [07:49:00] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 2984 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:50:18] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4042 is OK: SSL OK - OCSP staple validity for wikipedia.org has 306582 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 79 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:56:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P44942 and previous config saved to /var/cache/conftool/dbconfig/20230306-075632-marostegui.json [07:57:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P44943 and previous config saved to /var/cache/conftool/dbconfig/20230306-075737-marostegui.json [07:58:11] (03PS1) 10Nicolas Fraison: hadoop: decrease log retention from 40d to 14d [puppet] - 10https://gerrit.wikimedia.org/r/894481 [08:00:04] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-eqiad [08:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:59] OK. I'm here jouncebot [08:01:14] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P44944 and previous config saved to /var/cache/conftool/dbconfig/20230306-080132-marostegui.json [08:01:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482) (owner: 10KartikMistry) [08:01:40] kart_: feel free to self-deploy if you want :) [08:01:48] urbanecm: yeah. Doing it :) [08:01:52] urbanecm: Thanks! [08:01:57] ack! :) [08:02:24] (03Merged) 10jenkins-bot: Content Translation: Adjust the global limit for unedited MT to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482) (owner: 10KartikMistry) [08:02:48] !log kartik@deploy2002 Started scap: Backport for [[gerrit:891818|Content Translation: Adjust the global limit for unedited MT to 95% (T330482)]] [08:02:54] T330482: Adjust the global limit for unedited MT to 95% - https://phabricator.wikimedia.org/T330482 [08:04:54] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:06:42] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4037 is OK: SSL OK - OCSP staple validity for wikipedia.org has 312798 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:07:38] k8 images build/push step taking too long? [08:07:43] urbanecm: ^ [08:09:37] 7 min. Now, seems OK. [08:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P44945 and previous config saved to /var/cache/conftool/dbconfig/20230306-081138-marostegui.json [08:12:39] !log kartik@deploy2002 kartik: Backport for [[gerrit:891818|Content Translation: Adjust the global limit for unedited MT to 95% (T330482)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T328817)', diff saved to https://phabricator.wikimedia.org/P44946 and previous config saved to /var/cache/conftool/dbconfig/20230306-081244-marostegui.json [08:12:45] T330482: Adjust the global limit for unedited MT to 95% - https://phabricator.wikimedia.org/T330482 [08:12:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:12:50] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:12:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:13:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T328817)', diff saved to https://phabricator.wikimedia.org/P44947 and previous config saved to /var/cache/conftool/dbconfig/20230306-081305-marostegui.json [08:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44948 and previous config saved to /var/cache/conftool/dbconfig/20230306-081310-root.json [08:14:52] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:16:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P44949 and previous config saved to /var/cache/conftool/dbconfig/20230306-081639-marostegui.json [08:16:40] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 358999 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:16:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:16:45] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:17:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44950 and previous config saved to /var/cache/conftool/dbconfig/20230306-081711-marostegui.json [08:17:45] (03CR) 10Filippo Giunchedi: [C: 03+1] data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) (owner: 10Eevans) [08:18:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T328817)', diff saved to https://phabricator.wikimedia.org/P44951 and previous config saved to /var/cache/conftool/dbconfig/20230306-081857-marostegui.json [08:19:03] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:19:14] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fgiunchedi) [08:21:30] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:22:00] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:891818|Content Translation: Adjust the global limit for unedited MT to 95% (T330482)]] (duration: 19m 12s) [08:22:06] T330482: Adjust the global limit for unedited MT to 95% - https://phabricator.wikimedia.org/T330482 [08:22:26] 10SRE, 10Cloud-Services, 10Traffic, 10cloud-services-team: Horizon/lvs alerts the wrong people (and also is generally too sensitive) - https://phabricator.wikimedia.org/T331197 (10fgiunchedi) The easiest thing to do ATM I think is set `page: false` in `service::catalog` for the labweb service(s), this way... [08:22:37] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [08:23:18] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 189401 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:38] (03CR) 10Filippo Giunchedi: [C: 03+2] Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:24:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-eqiad [08:25:15] urbanecm: done with deployment. [08:26:08] Ack. I don't have anything else :-). [08:26:20] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P44952 and previous config saved to /var/cache/conftool/dbconfig/20230306-082645-marostegui.json [08:26:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:26:52] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:27:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:28:08] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4042 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 358312 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:28:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44953 and previous config saved to /var/cache/conftool/dbconfig/20230306-082815-root.json [08:28:41] !log rolling restart of Apache on mw* to pick up apr-util security updates [08:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:30:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:30:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:30:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:30:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T329260)', diff saved to https://phabricator.wikimedia.org/P44954 and previous config saved to /var/cache/conftool/dbconfig/20230306-083038-marostegui.json [08:30:48] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T329260)', diff saved to https://phabricator.wikimedia.org/P44955 and previous config saved to /var/cache/conftool/dbconfig/20230306-083147-marostegui.json [08:31:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:32:42] (03PS1) 10Nicolas Fraison: hive: Fix max metaspace size of hiveserver2 prod to 512m [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) [08:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P44956 and previous config saved to /var/cache/conftool/dbconfig/20230306-083403-marostegui.json [08:34:22] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 311137 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:37:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:37:43] (03PS2) 10Muehlenhoff: Failover urldownloader [dns] - 10https://gerrit.wikimedia.org/r/894047 (https://phabricator.wikimedia.org/T329073) [08:38:52] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 310867 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:39:46] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:40:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44957 and previous config saved to /var/cache/conftool/dbconfig/20230306-084017-marostegui.json [08:40:24] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:41:32] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 310707 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:42:28] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4037 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 83851 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [08:43:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44958 and previous config saved to /var/cache/conftool/dbconfig/20230306-084320-root.json [08:44:18] (03CR) 10Muehlenhoff: [C: 03+2] Failover urldownloader [dns] - 10https://gerrit.wikimedia.org/r/894047 (https://phabricator.wikimedia.org/T329073) (owner: 10Muehlenhoff) [08:45:53] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [08:46:04] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4037 is OK: SSL OK - OCSP staple validity for wikipedia.org has 310435 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P44959 and previous config saved to /var/cache/conftool/dbconfig/20230306-084653-marostegui.json [08:48:37] (03PS1) 10Vgutierrez: hiera: Disable HAProxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/894484 (https://phabricator.wikimedia.org/T323944) [08:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P44960 and previous config saved to /var/cache/conftool/dbconfig/20230306-084910-marostegui.json [08:53:50] (03PS1) 10Filippo Giunchedi: search-platform: split RDF streaming updater alerts for 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/894485 (https://phabricator.wikimedia.org/T309182) [08:55:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P44961 and previous config saved to /var/cache/conftool/dbconfig/20230306-085523-marostegui.json [08:55:42] (03CR) 10Vgutierrez: [C: 03+1] sre: more readable varnish/haproxy frontend unavailable [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi) [08:56:35] (03CR) 10Nicolas Fraison: [C: 03+1] Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [08:57:05] (03CR) 10Nicolas Fraison: [C: 03+1] Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [08:57:29] (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.5.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [08:57:52] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: more readable varnish/haproxy frontend unavailable [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi) [08:57:56] (03PS2) 10Filippo Giunchedi: sre: more readable varnish/haproxy frontend unavailable [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) [08:58:15] (03Merged) 10jenkins-bot: Update Gerrit to v3.5.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [08:58:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44962 and previous config saved to /var/cache/conftool/dbconfig/20230306-085825-root.json [08:59:46] Good morning [09:00:01] I am going to upgrade Gerrit, it will be unavailable for some minute(s) [09:00:04] hashar: My dear minions, it's time we take the moon! Just kidding. Time for Gerrit upgrade deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T0900). [09:00:14] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39955/console" [puppet] - 10https://gerrit.wikimedia.org/r/894484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [09:00:29] !log hashar@deploy2002 Started deploy [gerrit/gerrit@b725ff6]: Gerrit to 3.5.5 on gerrit2002 [09:00:36] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@b725ff6]: Gerrit to 3.5.5 on gerrit2002 (duration: 00m 07s) [09:01:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable HAProxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/894484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [09:02:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P44963 and previous config saved to /var/cache/conftool/dbconfig/20230306-090200-marostegui.json [09:02:08] !log disabling haproxy systemd service unit hardening in ulsfo - T323944 [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [09:04:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T328817)', diff saved to https://phabricator.wikimedia.org/P44964 and previous config saved to /var/cache/conftool/dbconfig/20230306-090416-marostegui.json [09:04:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:04:23] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:04:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:06:31] I am restarting Gerrit now [09:06:37] !log hashar@deploy2002 Started deploy [gerrit/gerrit@b725ff6]: Gerrit to 3.5.5 on gerrit1001 [09:06:49] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@b725ff6]: Gerrit to 3.5.5 on gerrit1001 (duration: 00m 12s) [09:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P44965 and previous config saved to /var/cache/conftool/dbconfig/20230306-091030-marostegui.json [09:10:33] (03CR) 10Klausman: [C: 03+1] kserve: upgrade to 0.10 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894015 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [09:10:56] (03CR) 10Klausman: [C: 03+1] profile::service_proxy::envoy: add support for inference [puppet] - 10https://gerrit.wikimedia.org/r/894014 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:11:45] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8663822, @MoritzMuehlenhoff wrote: > An agreement on the correct fix has been found, but a fix still needs to be made within e2fsprogs on the Debian si... [09:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44966 and previous config saved to /var/cache/conftool/dbconfig/20230306-091330-root.json [09:14:13] !log depooling & restarting blazegraph on wdqs1006 (stuck for 48+ hours) [09:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:19] (03PS1) 10Slyngshede: P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 [09:16:38] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:17:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] kserve: upgrade to 0.10 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894015 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [09:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T329260)', diff saved to https://phabricator.wikimedia.org/P44967 and previous config saved to /var/cache/conftool/dbconfig/20230306-091706-marostegui.json [09:17:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:17:14] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:17:14] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:17:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:17:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:17:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:17:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T329260)', diff saved to https://phabricator.wikimedia.org/P44968 and previous config saved to /var/cache/conftool/dbconfig/20230306-091728-marostegui.json [09:17:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44969 and previous config saved to /var/cache/conftool/dbconfig/20230306-091733-marostegui.json [09:17:48] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T329260)', diff saved to https://phabricator.wikimedia.org/P44970 and previous config saved to /var/cache/conftool/dbconfig/20230306-091836-marostegui.json [09:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44971 and previous config saved to /var/cache/conftool/dbconfig/20230306-092320-marostegui.json [09:23:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:25:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10nickifeajika) ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICd+b9kgn5LvOPSCTQjxKorKfyqxTRSVQJczx+Gd+eBq nifeajika-ctr@wikimedia.org [09:25:34] 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10Vgutierrez) I've disabled the systemd hardening after confirming issues in ulsfo: `counterexample vgutierrez@cp4041:~$ ps auxww |grep haproxy |wc -l 49 ` HAProxy is unable to terminate old proce... [09:25:36] RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44972 and previous config saved to /var/cache/conftool/dbconfig/20230306-092536-marostegui.json [09:25:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:25:43] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:25:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10nickifeajika) {F36893781} [09:25:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:25:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44973 and previous config saved to /var/cache/conftool/dbconfig/20230306-092557-marostegui.json [09:28:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44974 and previous config saved to /var/cache/conftool/dbconfig/20230306-092836-root.json [09:28:52] PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:33] (03PS2) 10Slyngshede: P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 [09:33:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P44975 and previous config saved to /var/cache/conftool/dbconfig/20230306-093343-marostegui.json [09:33:48] (03CR) 10CI reject: [V: 04-1] P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [09:34:37] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [09:36:42] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-conf1001.eqiad.wmnet with OS bullseye [09:38:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P44976 and previous config saved to /var/cache/conftool/dbconfig/20230306-093827-marostegui.json [09:40:07] (03CR) 10Elukey: [C: 03+2] kserve: upgrade to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894058 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [09:42:24] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:42:36] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:42:54] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [09:43:34] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39960/console" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [09:43:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44977 and previous config saved to /var/cache/conftool/dbconfig/20230306-094341-root.json [09:45:03] (03PS1) 10Nicolas Fraison: partman: correct path to custom zk-raid [puppet] - 10https://gerrit.wikimedia.org/r/894532 [09:45:21] (03PS1) 10Elukey: kserve: update docker image versions to 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894533 (https://phabricator.wikimedia.org/T331114) [09:45:48] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39961/console" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [09:46:01] (03CR) 10Nicolas Fraison: [C: 03+2] partman: correct path to custom zk-raid [puppet] - 10https://gerrit.wikimedia.org/r/894532 (owner: 10Nicolas Fraison) [09:47:24] (03CR) 10Jaime Nuche: [C: 03+1] "Matches current ownership on the host. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893787 (owner: 10Clément Goubert) [09:47:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet [09:47:53] (03CR) 10Ottomata: [C: 03+2] Switch the druid datasource for aqs to use the latest mediwiki_history [puppet] - 10https://gerrit.wikimedia.org/r/894049 (owner: 10Btullis) [09:48:33] (03CR) 10Clément Goubert: [C: 03+2] P:releases:mediawiki: Fix /srv/patches ownership [puppet] - 10https://gerrit.wikimedia.org/r/893787 (owner: 10Clément Goubert) [09:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P44978 and previous config saved to /var/cache/conftool/dbconfig/20230306-094849-marostegui.json [09:49:11] !log nfraison@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host an-conf1001.eqiad.wmnet with OS bullseye [09:49:38] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-conf1001.eqiad.wmnet with OS bullseye [09:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44979 and previous config saved to /var/cache/conftool/dbconfig/20230306-094944-marostegui.json [09:49:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:51:26] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:phabricator::aphlict: Set deploy_root as git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/893762 (owner: 10Clément Goubert) [09:52:04] (03CR) 10David Caro: [C: 03+2] metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 (https://phabricator.wikimedia.org/T325617) (owner: 10David Caro) [09:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P44980 and previous config saved to /var/cache/conftool/dbconfig/20230306-095333-marostegui.json [09:55:09] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39962/console" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (https://phabricator.wikimedia.org/T325617) (owner: 10David Caro) [09:57:34] (03CR) 10Volans: "Reminder inline" [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [09:58:55] (03CR) 10Elukey: [C: 03+2] kserve: update docker image versions to 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894533 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [09:59:24] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [09:59:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host deploy1002.eqiad.wmnet [10:02:03] (03CR) 10Elukey: "Ben I think that we should hold off until we decide what to do with the unencrypted traffic codfw -> eqiad. Having codfw depooled will def" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [10:03:25] (03PS1) 10Btullis: Disable all gobblin jobs to allow for HDFS maintenance [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) [10:03:33] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:03:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T329260)', diff saved to https://phabricator.wikimedia.org/P44981 and previous config saved to /var/cache/conftool/dbconfig/20230306-100356-marostegui.json [10:03:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:04:03] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:04:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:04:13] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T329260)', diff saved to https://phabricator.wikimedia.org/P44982 and previous config saved to /var/cache/conftool/dbconfig/20230306-100417-marostegui.json [10:04:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P44983 and previous config saved to /var/cache/conftool/dbconfig/20230306-100450-marostegui.json [10:05:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39963/console" [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [10:05:29] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1001.eqiad.wmnet with reason: host reimage [10:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T329260)', diff saved to https://phabricator.wikimedia.org/P44984 and previous config saved to /var/cache/conftool/dbconfig/20230306-100626-marostegui.json [10:06:29] (03PS2) 10Clément Goubert: trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) [10:07:06] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:07:14] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:08:02] (03CR) 10CI reject: [V: 04-1] P:phabricator::aphlict: Disable git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:08:21] (03CR) 10CI reject: [V: 04-1] P:phabricator::aphlict: Disable git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:08:36] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1001.eqiad.wmnet with reason: host reimage [10:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44985 and previous config saved to /var/cache/conftool/dbconfig/20230306-100840-marostegui.json [10:08:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:08:43] (03PS1) 10Elukey: kserve: fix missing comma in yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/894541 (https://phabricator.wikimedia.org/T331114) [10:08:46] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:08:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:09:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T328817)', diff saved to https://phabricator.wikimedia.org/P44986 and previous config saved to /var/cache/conftool/dbconfig/20230306-100901-marostegui.json [10:12:20] (03PS1) 10Jaime Nuche: thumbor: temporarily disable Scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/894542 [10:12:28] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:31] (03CR) 10Btullis: [V: 03+1] "The only issue with this is that it deletes historical logs, as shown by the pcc run." [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [10:12:42] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:12:44] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:12:51] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:14:06] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10MatthewVernon) @hashar are there still things that need doing on this task? It looks like h... [10:14:39] (03PS3) 10Clément Goubert: P:phabricator::aphlict: git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 [10:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T328817)', diff saved to https://phabricator.wikimedia.org/P44987 and previous config saved to /var/cache/conftool/dbconfig/20230306-101450-marostegui.json [10:14:56] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:15:01] (03CR) 10CI reject: [V: 04-1] P:phabricator::aphlict: git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:16:01] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39964/console" [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:16:33] (03PS4) 10Clément Goubert: P:phabricator::aphlict: git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 [10:16:37] (03PS1) 10Vgutierrez: cache::haproxy: Grant CAP_KILL on hardened mode [puppet] - 10https://gerrit.wikimedia.org/r/894544 (https://phabricator.wikimedia.org/T323944) [10:17:38] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:17:51] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39966/console" [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:18:33] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:19:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P44988 and previous config saved to /var/cache/conftool/dbconfig/20230306-101957-marostegui.json [10:21:07] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:phabricator::aphlict: git safe.dir [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P44989 and previous config saved to /var/cache/conftool/dbconfig/20230306-102132-marostegui.json [10:21:44] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:21:46] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Grant CAP_KILL on hardened mode [puppet] - 10https://gerrit.wikimedia.org/r/894544 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [10:22:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10MatthewVernon) @thcipriani can I ping you about this approval, please? [10:24:12] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:30] (03PS1) 10Vgutierrez: hiera: Enable HAProxy systemd hardening in cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/894545 (https://phabricator.wikimedia.org/T323944) [10:24:46] (03CR) 10Muehlenhoff: P:phabricator::aphlict: git safe.dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:26:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39968/console" [puppet] - 10https://gerrit.wikimedia.org/r/894545 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [10:27:19] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable HAProxy systemd hardening in cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/894545 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [10:27:24] (03CR) 10Kosta Harlan: [C: 03+1] Stop refining SpecialMuteSubmit events [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [10:28:39] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:phabricator::aphlict: git safe.dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:28:42] (03PS1) 10Alexandros Kosiaris: wikikube eqiad: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) [10:29:13] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1001.eqiad.wmnet with OS bullseye [10:29:32] !log enable haproxy systemd service unit hardening in cp4045 - T323944 [10:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:37] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [10:29:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P44990 and previous config saved to /var/cache/conftool/dbconfig/20230306-102956-marostegui.json [10:30:33] (03CR) 10Jbond: "lgtm couple of minor nits/questions" [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [10:31:26] (03PS2) 10Alexandros Kosiaris: wikikube eqiad: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) [10:31:43] 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10Vgutierrez) @ssingh this could be as easy to fix as granting `CAP_KILL`, I'm currently testing that on cp4045 [10:32:28] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:32:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter1005.eqiad.wmnet [10:33:36] (03PS2) 10Btullis: Disable all gobblin jobs to allow for HDFS maintenance [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) [10:33:42] (03CR) 10Muehlenhoff: P:IDM Minor fixes and restructure. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [10:34:01] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:phabricator::aphlict: git safe.dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894539 (owner: 10Clément Goubert) [10:34:09] 10SRE-tools, 10Infrastructure-Foundations: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135 (10jbond) p:05Triage→03Medium [10:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44991 and previous config saved to /var/cache/conftool/dbconfig/20230306-103503-marostegui.json [10:35:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:35:11] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:35:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T329203)', diff saved to https://phabricator.wikimedia.org/P44992 and previous config saved to /var/cache/conftool/dbconfig/20230306-103525-marostegui.json [10:36:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1005.eqiad.wmnet [10:36:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P44993 and previous config saved to /var/cache/conftool/dbconfig/20230306-103639-marostegui.json [10:36:52] (03PS1) 10Nicolas Fraison: partman: migrate to reuse-parts.cfg as zk partman is validated [puppet] - 10https://gerrit.wikimedia.org/r/894587 [10:37:06] (03PS2) 10Nicolas Fraison: partman: migrate to reuse-parts.cfg as zk partman is validated [puppet] - 10https://gerrit.wikimedia.org/r/894587 [10:38:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39969/console" [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [10:39:18] (03CR) 10Nicolas Fraison: "The Metaspace memory usage looks quite better being stable at 183MB" [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [10:45:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10nickifeajika) [10:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P44994 and previous config saved to /var/cache/conftool/dbconfig/20230306-104503-marostegui.json [10:46:05] (03CR) 10Btullis: [C: 03+1] "Looks good to me, but I'd like to get a +1 from joal in case there is more back-story to the long retention time." [puppet] - 10https://gerrit.wikimedia.org/r/894481 (owner: 10Nicolas Fraison) [10:46:24] (03PS1) 10Clément Goubert: Revert "P:phabricator::aphlict: git safe.dir" [puppet] - 10https://gerrit.wikimedia.org/r/894547 [10:48:44] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:58] !log disable puppet fleet wide to reboot puppetdb [10:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:34] (03PS1) 10Clément Goubert: Revert "P:phabricator::aphlict: git safe.dir" [puppet] - 10https://gerrit.wikimedia.org/r/894548 [10:51:07] (03PS1) 10Alexandros Kosiaris: admin_ng: Update wikikube-eqiad settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894591 (https://phabricator.wikimedia.org/T326617) [10:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T329260)', diff saved to https://phabricator.wikimedia.org/P44995 and previous config saved to /var/cache/conftool/dbconfig/20230306-105145-marostegui.json [10:51:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:51:52] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:52:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:52:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T329260)', diff saved to https://phabricator.wikimedia.org/P44996 and previous config saved to /var/cache/conftool/dbconfig/20230306-105206-marostegui.json [10:53:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T329260)', diff saved to https://phabricator.wikimedia.org/P44997 and previous config saved to /var/cache/conftool/dbconfig/20230306-105315-marostegui.json [10:53:50] (03CR) 10David Caro: [C: 03+2] grafana: remove home test [puppet] - 10https://gerrit.wikimedia.org/r/875957 (owner: 10David Caro) [10:56:08] PROBLEM - Host puppetdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:38] (03CR) 10Joal: [C: 03+1] "I can't remember us looking at month old logs - Maybe when doing development and willing to look back at old tests - For this use case we " [puppet] - 10https://gerrit.wikimedia.org/r/894481 (owner: 10Nicolas Fraison) [10:57:14] RECOVERY - Host puppetdb1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:57:14] PROBLEM - Host puppetdb2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:06] RECOVERY - Host puppetdb2003 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [10:58:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T329203)', diff saved to https://phabricator.wikimedia.org/P44998 and previous config saved to /var/cache/conftool/dbconfig/20230306-105834-marostegui.json [10:58:36] (expected) [10:58:41] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:59:41] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1100) [11:00:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T328817)', diff saved to https://phabricator.wikimedia.org/P44999 and previous config saved to /var/cache/conftool/dbconfig/20230306-110009-marostegui.json [11:00:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:00:16] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:00:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45000 and previous config saved to /var/cache/conftool/dbconfig/20230306-110031-marostegui.json [11:01:26] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: puppetdb.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:40] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:45] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:02:07] (03PS1) 10Kosta Harlan: GrowthExperiments: Make new impact module default on betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894594 (https://phabricator.wikimedia.org/T328757) [11:03:30] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:12] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:06:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45001 and previous config saved to /var/cache/conftool/dbconfig/20230306-110620-marostegui.json [11:06:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P45002 and previous config saved to /var/cache/conftool/dbconfig/20230306-110822-marostegui.json [11:09:12] !log enable puppet fleet wide to post reboot puppetdb [11:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1002 is OK: HTTP OK: HTTP/1.1 200 OK - 889558 bytes in 5.326 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:10:18] (03CR) 10Clément Goubert: [C: 03+2] Revert "P:phabricator::aphlict: git safe.dir" [puppet] - 10https://gerrit.wikimedia.org/r/894547 (owner: 10Clément Goubert) [11:10:29] (03CR) 10Clément Goubert: [C: 03+2] Revert "P:phabricator::aphlict: git safe.dir" [puppet] - 10https://gerrit.wikimedia.org/r/894548 (owner: 10Clément Goubert) [11:12:32] (03CR) 10Nicolas Fraison: [C: 03+1] Disable all gobblin jobs to allow for HDFS maintenance [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [11:13:13] (03CR) 10Clément Goubert: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/894597 (owner: 10Clément Goubert) [11:13:25] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Revert "P:phabricator::aphlict: Set deploy_root as git safe.dir" [puppet] - 10https://gerrit.wikimedia.org/r/894597 (owner: 10Clément Goubert) [11:13:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P45003 and previous config saved to /var/cache/conftool/dbconfig/20230306-111340-marostegui.json [11:14:55] (03CR) 10Btullis: "Whay is it that we need to have both the hostname and fqdn in the hiera file when excluding hosts? I couldn't immediately see the reason f" [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [11:15:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter1004.eqiad.wmnet [11:15:46] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:05] 10SRE, 10serviceops: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10jijiki) a:05akosiaris→03jijiki [11:18:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1004.eqiad.wmnet [11:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P45005 and previous config saved to /var/cache/conftool/dbconfig/20230306-112126-marostegui.json [11:21:54] (03PS1) 10Lucas Werkmeister (WMDE): termbox(test): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894598 (https://phabricator.wikimedia.org/T309176) [11:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P45006 and previous config saved to /var/cache/conftool/dbconfig/20230306-112328-marostegui.json [11:23:59] (03PS1) 10Lucas Werkmeister (WMDE): termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) [11:26:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10MatthewVernon) [11:28:28] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P45007 and previous config saved to /var/cache/conftool/dbconfig/20230306-112847-marostegui.json [11:30:27] (03PS1) 10Vgutierrez: hiera: Enable ESI testing in cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/894600 (https://phabricator.wikimedia.org/T308799) [11:31:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:31:58] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:46] (03CR) 10Jbond: [C: 03+1] admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) (owner: 10Hnowlan) [11:35:58] (03CR) 10Jbond: [C: 03+1] data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) (owner: 10Slyngshede) [11:36:31] (03PS2) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) [11:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P45008 and previous config saved to /var/cache/conftool/dbconfig/20230306-113633-marostegui.json [11:37:05] (03PS1) 10Volans: k8s: fix existing docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/894601 [11:37:24] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:43] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable ESI testing in cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/894600 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [11:38:21] (03PS1) 10Vgutierrez: Revert "hiera: Enable ESI testing in cp4044" [puppet] - 10https://gerrit.wikimedia.org/r/894557 [11:38:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T329260)', diff saved to https://phabricator.wikimedia.org/P45009 and previous config saved to /var/cache/conftool/dbconfig/20230306-113835-marostegui.json [11:38:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:38:41] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:38:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:38:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T329260)', diff saved to https://phabricator.wikimedia.org/P45010 and previous config saved to /var/cache/conftool/dbconfig/20230306-113856-marostegui.json [11:39:27] (03CR) 10Jbond: Management routers: move ssh port to 2222 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T329260)', diff saved to https://phabricator.wikimedia.org/P45011 and previous config saved to /var/cache/conftool/dbconfig/20230306-114004-marostegui.json [11:40:14] (03CR) 10Jbond: [C: 03+1] P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [11:40:26] (03CR) 10Jbond: [C: 03+1] osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905 (owner: 10Majavah) [11:41:53] (03CR) 10Jbond: [C: 03+1] harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 (owner: 10David Caro) [11:42:43] !log enable ESI testing in cp4044 - T308799 [11:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:49] T308799: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 [11:43:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T329203)', diff saved to https://phabricator.wikimedia.org/P45012 and previous config saved to /var/cache/conftool/dbconfig/20230306-114354-marostegui.json [11:43:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:44:00] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:44:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:46:21] (03CR) 10Jbond: "Sorry i missed this, is it still valid? i notice it at the very least needs a rebase" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [11:46:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 (owner: 10Volans) [11:46:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [11:47:26] (03CR) 10Volans: "This will likely need a related change to the sre laptop package too and notify all users with network device access to update their ssh c" [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:48:29] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: pick the NIC with LinkUp [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 (owner: 10Volans) [11:50:22] (03Merged) 10jenkins-bot: sre.hosts.provision: pick the NIC with LinkUp [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 (owner: 10Volans) [11:50:27] (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [11:51:36] (03PS1) 10MVernon: Add user aranyap (analytics-privatedata-users, krb) [puppet] - 10https://gerrit.wikimedia.org/r/894603 (https://phabricator.wikimedia.org/T331067) [11:51:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45013 and previous config saved to /var/cache/conftool/dbconfig/20230306-115140-marostegui.json [11:51:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:51:48] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:51:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T328817)', diff saved to https://phabricator.wikimedia.org/P45014 and previous config saved to /var/cache/conftool/dbconfig/20230306-115201-marostegui.json [11:52:50] (03PS1) 10Ottomata: flink-kubernetes-operator - upstream release 1.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894604 (https://phabricator.wikimedia.org/T331282) [11:54:07] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - upstream release 1.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894604 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [11:54:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator - upstream release 1.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894604 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [11:54:58] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) [11:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P45015 and previous config saved to /var/cache/conftool/dbconfig/20230306-115511-marostegui.json [11:56:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/894603 (https://phabricator.wikimedia.org/T331067) (owner: 10MVernon) [11:56:59] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:57:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894601 (owner: 10Volans) [11:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T328817)', diff saved to https://phabricator.wikimedia.org/P45016 and previous config saved to /var/cache/conftool/dbconfig/20230306-115748-marostegui.json [11:57:55] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:57:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10MatthewVernon) This request needs approval by an engineering manager from WMDE - @WMDE-leszek since you're already subscribed to this task, are you happy to approve this request, pl... [11:58:11] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:58:54] (03CR) 10MVernon: [C: 03+2] Add user aranyap (analytics-privatedata-users, krb) [puppet] - 10https://gerrit.wikimedia.org/r/894603 (https://phabricator.wikimedia.org/T331067) (owner: 10MVernon) [11:59:10] (03CR) 10Btullis: [V: 03+1] Add an entry in the service catalog for the aqs service running in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [11:59:19] (03CR) 10Volans: [C: 03+2] k8s: fix existing docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/894601 (owner: 10Volans) [12:01:56] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) p:05Triage→03High [12:02:10] (03PS1) 10Clément Goubert: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 [12:02:30] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for eqiad repooling" [dns] - 10https://gerrit.wikimedia.org/r/894559 (owner: 10Clément Goubert) [12:02:31] (03PS2) 10Clément Goubert: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 [12:03:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:03:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T329203)', diff saved to https://phabricator.wikimedia.org/P45017 and previous config saved to /var/cache/conftool/dbconfig/20230306-120328-marostegui.json [12:03:35] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:03:50] (03Merged) 10jenkins-bot: k8s: fix existing docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/894601 (owner: 10Volans) [12:04:05] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [12:04:33] (03PS3) 10Clément Goubert: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 (https://phabricator.wikimedia.org/T331285) [12:05:31] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [12:05:39] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:06:16] (03CR) 10Jbond: "im going to abandon this but if there is interest from wmcs happy to revive" [puppet] - 10https://gerrit.wikimedia.org/r/737774 (owner: 10Jbond) [12:06:21] (03Abandoned) 10Jbond: P:openstack::base::cloudgw: drop unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/737774 (owner: 10Jbond) [12:06:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi @aranyap this is all done for you now. [12:10:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P45018 and previous config saved to /var/cache/conftool/dbconfig/20230306-121018-marostegui.json [12:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P45019 and previous config saved to /var/cache/conftool/dbconfig/20230306-121255-marostegui.json [12:16:43] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [12:23:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T329203)', diff saved to https://phabricator.wikimedia.org/P45020 and previous config saved to /var/cache/conftool/dbconfig/20230306-122334-marostegui.json [12:23:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:25:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T329260)', diff saved to https://phabricator.wikimedia.org/P45021 and previous config saved to /var/cache/conftool/dbconfig/20230306-122524-marostegui.json [12:25:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [12:25:31] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:25:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [12:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T329260)', diff saved to https://phabricator.wikimedia.org/P45022 and previous config saved to /var/cache/conftool/dbconfig/20230306-122546-marostegui.json [12:26:05] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10WMDE-leszek) I hereby approve the request. Thank you. [12:26:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T329260)', diff saved to https://phabricator.wikimedia.org/P45023 and previous config saved to /var/cache/conftool/dbconfig/20230306-122654-marostegui.json [12:28:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P45024 and previous config saved to /var/cache/conftool/dbconfig/20230306-122801-marostegui.json [12:28:10] (03CR) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [12:28:58] (03PS1) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) [12:29:45] (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks for doing this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff) [12:31:16] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:31:29] (03CR) 10CI reject: [V: 04-1] Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [12:32:28] (03CR) 10Jaime Nuche: "Catalog changes look good in PCC: https://puppet-compiler.wmflabs.org/output/894542/39967/ but class C:role::thumbor::mediawiki strangely " [puppet] - 10https://gerrit.wikimedia.org/r/894542 (owner: 10Jaime Nuche) [12:32:58] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-conf1002.eqiad.wmnet with OS bullseye [12:33:08] (03CR) 10Cathal Mooney: Management routers: move ssh port to 2222 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:36:10] (03PS2) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) [12:36:46] (03CR) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [12:37:11] (03PS10) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [12:38:41] (03CR) 10CI reject: [V: 04-1] Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [12:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P45025 and previous config saved to /var/cache/conftool/dbconfig/20230306-123841-marostegui.json [12:39:20] (03PS3) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) [12:39:36] (03CR) 10CI reject: [V: 04-1] Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [12:42:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P45026 and previous config saved to /var/cache/conftool/dbconfig/20230306-124200-marostegui.json [12:42:04] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:43:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T328817)', diff saved to https://phabricator.wikimedia.org/P45027 and previous config saved to /var/cache/conftool/dbconfig/20230306-124308-marostegui.json [12:43:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:43:20] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:43:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T328817)', diff saved to https://phabricator.wikimedia.org/P45028 and previous config saved to /var/cache/conftool/dbconfig/20230306-124341-marostegui.json [12:46:19] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1002.eqiad.wmnet with reason: host reimage [12:46:20] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [12:46:27] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jnuche) Hi @hnowlan and @MoritzMuehlenhoff I've tentatively created a patch to try to address the problem with Scap updates and added you as reviewers: https://gerr... [12:48:35] (03PS2) 10Raymond Ndibe: wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH [puppet] - 10https://gerrit.wikimedia.org/r/894227 (https://phabricator.wikimedia.org/T303663) [12:48:48] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1002.eqiad.wmnet with reason: host reimage [12:49:12] (03CR) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [12:49:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T328817)', diff saved to https://phabricator.wikimedia.org/P45029 and previous config saved to /var/cache/conftool/dbconfig/20230306-124932-marostegui.json [12:49:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:50:26] (03PS11) 10Nicolas Fraison: hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) [12:53:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P45030 and previous config saved to /var/cache/conftool/dbconfig/20230306-125348-marostegui.json [12:53:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [12:54:30] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to roll-restart Restbase [cookbooks] - 10https://gerrit.wikimedia.org/r/893486 (owner: 10Muehlenhoff) [12:57:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P45031 and previous config saved to /var/cache/conftool/dbconfig/20230306-125707-marostegui.json [12:58:13] (03PS1) 10Filippo Giunchedi: alertmanager: add default cluster filter [puppet] - 10https://gerrit.wikimedia.org/r/894637 (https://phabricator.wikimedia.org/T323714) [13:01:32] (03CR) 10Filippo Giunchedi: "Change itself is fine, though I'm wondering if this is meant/supposed to be temporary or not?" [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P45032 and previous config saved to /var/cache/conftool/dbconfig/20230306-130438-marostegui.json [13:07:52] (03CR) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:08:13] (03PS1) 10Filippo Giunchedi: o11y: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/894638 (https://phabricator.wikimedia.org/T309182) [13:08:16] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1002.eqiad.wmnet with OS bullseye [13:08:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T329203)', diff saved to https://phabricator.wikimedia.org/P45033 and previous config saved to /var/cache/conftool/dbconfig/20230306-130854-marostegui.json [13:08:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [13:08:58] !log rearmed keyholder on deploy1002 following reboot [13:09:01] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [13:09:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:09:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:09:27] (03PS2) 10Filippo Giunchedi: o11y: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/894638 (https://phabricator.wikimedia.org/T309182) [13:09:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T329203)', diff saved to https://phabricator.wikimedia.org/P45034 and previous config saved to /var/cache/conftool/dbconfig/20230306-130933-marostegui.json [13:10:26] (03CR) 10Nicolas Fraison: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:10:46] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [13:11:12] (03CR) 10Btullis: [C: 03+1] partman: migrate to reuse-parts.cfg as zk partman is validated [puppet] - 10https://gerrit.wikimedia.org/r/894587 (owner: 10Nicolas Fraison) [13:12:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T329260)', diff saved to https://phabricator.wikimedia.org/P45035 and previous config saved to /var/cache/conftool/dbconfig/20230306-131214-marostegui.json [13:12:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:12:21] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:12:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:12:46] (03CR) 10Nicolas Fraison: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:12:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/894225 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [13:13:32] (03CR) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:13:33] (KeyholderUnarmed) resolved: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:14:34] (03CR) 10Nicolas Fraison: [C: 03+2] partman: migrate to reuse-parts.cfg as zk partman is validated [puppet] - 10https://gerrit.wikimedia.org/r/894587 (owner: 10Nicolas Fraison) [13:15:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [13:15:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [13:15:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T329260)', diff saved to https://phabricator.wikimedia.org/P45036 and previous config saved to /var/cache/conftool/dbconfig/20230306-131545-marostegui.json [13:15:57] (03CR) 10JMeybohm: [C: 04-1] wikikube eqiad: Update cluster settings for k8s 1.23 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [13:17:18] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Update wikikube-eqiad settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894591 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [13:17:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH [puppet] - 10https://gerrit.wikimedia.org/r/894227 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [13:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329260)', diff saved to https://phabricator.wikimedia.org/P45037 and previous config saved to /var/cache/conftool/dbconfig/20230306-131758-marostegui.json [13:18:07] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P45038 and previous config saved to /var/cache/conftool/dbconfig/20230306-131945-marostegui.json [13:23:29] (03CR) 10JMeybohm: Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [13:24:17] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [13:25:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: kubeadm: drop calico deployment [puppet] - 10https://gerrit.wikimedia.org/r/894641 (https://phabricator.wikimedia.org/T328539) [13:29:06] (03CR) 10Stevemunene: [C: 03+1] hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [13:29:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: kubeadm: drop calico deployment [puppet] - 10https://gerrit.wikimedia.org/r/894641 (https://phabricator.wikimedia.org/T328539) (owner: 10Arturo Borrero Gonzalez) [13:31:06] (03PS4) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) [13:31:59] (03Abandoned) 10JMeybohm: docker_registry_ha: add nginx rewrite for URLs with tags [puppet] - 10https://gerrit.wikimedia.org/r/695598 (https://phabricator.wikimedia.org/T283764) (owner: 10Dzahn) [13:32:05] (03CR) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:33:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P45039 and previous config saved to /var/cache/conftool/dbconfig/20230306-133304-marostegui.json [13:33:15] (03CR) 10JMeybohm: [C: 03+1] profile::service_proxy::envoy: add support for inference [puppet] - 10https://gerrit.wikimedia.org/r/894014 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:34:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for the context" [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [13:34:07] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-restbase rolling restart_daemons on A:restbase-canary [13:34:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-restbase (exit_code=0) rolling restart_daemons on A:restbase-canary [13:34:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T328817)', diff saved to https://phabricator.wikimedia.org/P45040 and previous config saved to /var/cache/conftool/dbconfig/20230306-133451-marostegui.json [13:34:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:34:58] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:35:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:40:13] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web [13:40:22] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1001.eqiad.wmnet,service=thanos-web [13:43:53] (03CR) 10Muehlenhoff: "Looks good in general, two additional comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [13:44:33] (03PS1) 10Ottomata: flink-kubernetes-operator - upgrade to upstream 1.4.0 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/894643 (https://phabricator.wikimedia.org/T331282) [13:45:24] (03CR) 10David Caro: [C: 03+2] harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 (owner: 10David Caro) [13:45:31] (03PS3) 10David Caro: harbor: Add robot accounts info [puppet] - 10https://gerrit.wikimedia.org/r/893481 [13:45:37] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator - upgrade to upstream 1.4.0 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/894643 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [13:47:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:48:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P45041 and previous config saved to /var/cache/conftool/dbconfig/20230306-134811-marostegui.json [13:48:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328817)', diff saved to https://phabricator.wikimedia.org/P45042 and previous config saved to /var/cache/conftool/dbconfig/20230306-134820-marostegui.json [13:48:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:52:39] (03CR) 10David Caro: [C: 03+2] alertmanager: add default cluster filter [puppet] - 10https://gerrit.wikimedia.org/r/894637 (https://phabricator.wikimedia.org/T323714) (owner: 10Filippo Giunchedi) [13:52:47] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T331074 (10Papaul) [13:53:34] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T331074 (10Papaul) 05Open→03Resolved complete [13:54:55] (03CR) 10David Caro: [C: 03+2] "One issue we will have in the future, is avoiding someone from creating a silence on the cloud alertmanager (that is enabled by default, a" [puppet] - 10https://gerrit.wikimedia.org/r/894637 (https://phabricator.wikimedia.org/T323714) (owner: 10Filippo Giunchedi) [13:58:34] (03CR) 10Filippo Giunchedi: alertmanager: add default cluster filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894637 (https://phabricator.wikimedia.org/T323714) (owner: 10Filippo Giunchedi) [13:59:52] (03PS1) 10MVernon: add nosc to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/894644 (https://phabricator.wikimedia.org/T330095) [14:00:02] (03CR) 10DCausse: [C: 03+1] "Thanks! this might actually fix T316882 :)" [alerts] - 10https://gerrit.wikimedia.org/r/894485 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1400). [14:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:25] o/ [14:02:06] * Lucas_WMDE looks up how to deployment-charts again [14:02:41] (03PS1) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [14:02:43] (03PS1) 10Jbond: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) [14:03:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329260)', diff saved to https://phabricator.wikimedia.org/P45043 and previous config saved to /var/cache/conftool/dbconfig/20230306-140317-marostegui.json [14:03:19] (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:03:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [14:03:25] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:03:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [14:03:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/894644 (https://phabricator.wikimedia.org/T330095) (owner: 10MVernon) [14:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T329260)', diff saved to https://phabricator.wikimedia.org/P45044 and previous config saved to /var/cache/conftool/dbconfig/20230306-140339-marostegui.json [14:04:05] (03CR) 10Jbond: "lgtm: ci relates to spdx" [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [14:04:50] (03PS2) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [14:05:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39971/console" [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) (owner: 10Jbond) [14:05:05] (03PS2) 10Jbond: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) [14:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T329203)', diff saved to https://phabricator.wikimedia.org/P45045 and previous config saved to /var/cache/conftool/dbconfig/20230306-140533-marostegui.json [14:05:36] (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:05:39] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:07:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10bking) [14:09:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/894598 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [14:09:56] (03PS3) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [14:11:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39972/console" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:12:07] (03PS1) 10Ssingh: ntp/eqiad: point to dns1002 temporarily [dns] - 10https://gerrit.wikimedia.org/r/894652 (https://phabricator.wikimedia.org/T329073) [14:12:13] (03PS4) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [14:12:25] (03CR) 10MVernon: [C: 03+2] add nosc to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/894644 (https://phabricator.wikimedia.org/T330095) (owner: 10MVernon) [14:12:52] (03CR) 10Jbond: "see full diff output" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:13:01] (03PS3) 10Jbond: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) [14:14:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329260)', diff saved to https://phabricator.wikimedia.org/P45046 and previous config saved to /var/cache/conftool/dbconfig/20230306-141404-marostegui.json [14:14:11] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:14:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39973/console" [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) (owner: 10Jbond) [14:14:32] (03Merged) 10jenkins-bot: termbox(test): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894598 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [14:15:19] (03CR) 10Jbond: "ssee full diff https://puppet-compiler.wmflabs.org/output/894647/39973/krb1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) (owner: 10Jbond) [14:15:22] (03CR) 10Ssingh: [C: 03+2] ntp/eqiad: point to dns1002 temporarily [dns] - 10https://gerrit.wikimedia.org/r/894652 (https://phabricator.wikimedia.org/T329073) (owner: 10Ssingh) [14:15:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328817)', diff saved to https://phabricator.wikimedia.org/P45047 and previous config saved to /var/cache/conftool/dbconfig/20230306-141534-marostegui.json [14:15:37] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [14:15:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:16:03] !log running authdns-update for CR 894652 [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done. [14:20:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P45048 and previous config saved to /var/cache/conftool/dbconfig/20230306-142039-marostegui.json [14:20:44] (03PS1) 10Ssingh: hiera: temporarily removed dns1001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/894654 (https://phabricator.wikimedia.org/T329073) [14:20:54] hm, is it normal that `helmfile -e staging -i apply` takes a while, with no visible progress? [14:21:09] (it’s “Upgrading release=test, chart=wmf-stable/termbox” apparently) [14:21:38] I guess it could take a while to distribute the image (for one thing, we switched the base from node 12 to 16) [14:22:02] (03PS1) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [14:22:08] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: split RDF streaming updater alerts for 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/894485 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:22:12] (03PS2) 10Filippo Giunchedi: search-platform: split RDF streaming updater alerts for 'ops' [alerts] - 10https://gerrit.wikimedia.org/r/894485 (https://phabricator.wikimedia.org/T309182) [14:25:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) [14:25:54] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [14:26:17] yikes, “timed out waiting for the condition” [14:26:29] (03CR) 10CI reject: [V: 04-1] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [14:27:20] any deployment-charts experts around? [14:27:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [14:28:18] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10akosiaris) @jnuche, once we figure out T328033 and thus are able to complete T233196, thumbor will have nothing to do with scap. It's probably not worth solving thi... [14:28:31] Lucas_WMDE: Bet that's the node not in path issue again [14:28:46] * Lucas_WMDE has not heard of that issue [14:28:52] Gimme a sec [14:28:56] is it something with the node16 images [14:28:57] ok sure [14:29:03] Yeah [14:29:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P45049 and previous config saved to /var/cache/conftool/dbconfig/20230306-142910-marostegui.json [14:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45050 and previous config saved to /var/cache/conftool/dbconfig/20230306-143041-marostegui.json [14:34:01] (03PS3) 10Filippo Giunchedi: o11y: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/894638 (https://phabricator.wikimedia.org/T309182) [14:34:03] (03PS1) 10Filippo Giunchedi: search-platform: restrict RDF streaming updater ops alerts [alerts] - 10https://gerrit.wikimedia.org/r/894656 (https://phabricator.wikimedia.org/T309182) [14:34:18] FWIW, the image seemed to work when I tried it locally with `docker run` [14:34:37] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [14:35:04] Lucas_WMDE: I can't find the phab issue, I know akosiaris was commenting on it. I'll go check your kube logs in the meantime [14:35:10] (03PS1) 10David Caro: karma: make cloud alertmanager read-only [puppet] - 10https://gerrit.wikimedia.org/r/894658 [14:35:27] ok thanks [14:35:41] I can also revert the gerrit change if that’s better [14:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P45051 and previous config saved to /var/cache/conftool/dbconfig/20230306-143546-marostegui.json [14:35:57] we don’t *need* the deployment for anything, just didn’t want to let the master branch drift too far away from production [14:36:07] 18m Warning Failed pod/termbox-test-5f85f849b7-lltww Error: failed to start container "termbox-test": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: exec: "nodejs": executable file not found in $PATH: unknown [14:36:14] hrmph [14:36:16] Yeah that's exactly the issue I remember [14:36:24] I saw a phab comment about node14 removing the nodejs→node symlink [14:36:25] Now to remember the fix x) [14:36:27] (after you mentioned it) [14:36:47] (03CR) 10Filippo Giunchedi: "I think with this the alerts will start working as expected (and not raise 'pint' problems due to metrics not found)" [alerts] - 10https://gerrit.wikimedia.org/r/894656 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:37:05] aha, `charts/termbox/templates/deployment.yaml` still has a “nodejs” in it? [14:37:14] (03CR) 10Filippo Giunchedi: [C: 03+1] karma: make cloud alertmanager read-only [puppet] - 10https://gerrit.wikimedia.org/r/894658 (owner: 10David Caro) [14:37:41] yeah, switch the helm chart to use node [14:37:45] o_O so the `entrypoint` in our blubber.yaml actually has no effect? [14:37:51] and gets overridden by that deployment.yaml? [14:37:55] ok I’ll upload a change [14:38:01] the alternative is (if the entrypoint of the image is fine) is to totally remove them [14:38:14] both will probably work Lucas_WMDE [14:38:22] hm, the args are different at least [14:38:30] blubber.yaml doesn’t have /etc/termbox/config.yaml [14:38:30] unfortunately not all images can claim that [14:38:39] probably better to just change nodejs to node then [14:39:27] ah yes, I see that templates/_config.yaml was quite a bit of things in it, we best stick to that for now [14:39:59] (03PS2) 10Lucas Werkmeister (WMDE): termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) [14:40:01] (03PS1) 10Lucas Werkmeister (WMDE): termbox: fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) [14:40:12] let me just quickly check that the old image *actually* works with node too [14:40:15] (as far as I can tell at least) [14:41:45] (03CR) 10Lucas Werkmeister (WMDE): termbox: fix entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:42:02] (03PS26) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [14:42:14] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MatthewVernon) [14:42:22] do either of you want to give that a +1 or +2? [14:42:26] (03CR) 10Clément Goubert: Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [14:42:30] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:42:38] (03CR) 10Clément Goubert: [C: 03+1] termbox: fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:42:42] (I also realized I didn’t get anyone’s review on my first change to bump the version, which was bad, sorry) [14:42:59] I probably wouldn't have caught it tbh [14:43:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] termbox: fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:43:38] btw Lucas_WMDE, I added a bit of troubleshooting doc for deployers at https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting [14:43:43] (03CR) 10Herron: [C: 03+2] grafana: serve grafana/grafana-rw from codfw [puppet] - 10https://gerrit.wikimedia.org/r/894056 (https://phabricator.wikimedia.org/T329073) (owner: 10Herron) [14:43:58] cool, thanks! [14:43:59] It's not perfect or exhaustive by any means, but I don't know if you'd seen it or not [14:44:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) @Miriam are you OK to approve this as the relevant manager, please? [14:44:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) [14:44:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10MatthewVernon) [14:44:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P45053 and previous config saved to /var/cache/conftool/dbconfig/20230306-144417-marostegui.json [14:45:00] ok, kube-env and then I can use kubectl [14:45:18] yup, now I can see the error message [14:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45054 and previous config saved to /var/cache/conftool/dbconfig/20230306-144547-marostegui.json [14:47:00] (03CR) 10DCausse: [C: 03+1] "thanks! :)" [alerts] - 10https://gerrit.wikimedia.org/r/894656 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:47:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) For the analytics-privatedata-users + krb request, this needs approval by @Ottomata or @odimitrijevic - can you approve, please? [14:47:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) [14:48:03] claime: so once the change merges and is pulled to deploy2002, I assume I should helmfile apply it to staging+eqiad+codfw [14:48:18] yes [14:48:19] and then later, if everything works, I can resum with my second change, bumping the version for prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/894599/ [14:48:23] ok [14:48:27] thanks a lot so far! [14:48:35] (*resume) [14:48:47] (03Merged) 10jenkins-bot: termbox: fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:49:27] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [14:49:43] hmmmm [14:49:51] `git show` already showed the nodejs→node change [14:50:01] but the diff helmfile shows me still only changes the image version [14:50:05] Hmm I think you didn't bump the chart version [14:50:08] and the command, in the context right below and unmodified, is still nodejs [14:50:10] ah [14:50:21] * Lucas_WMDE says “no” to helmfile [14:50:24] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [14:50:31] I didn't see the change was in the chart and not in the deployment, my bad [14:50:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T329203)', diff saved to https://phabricator.wikimedia.org/P45055 and previous config saved to /var/cache/conftool/dbconfig/20230306-145052-marostegui.json [14:50:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:50:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] WikiKube eqiad: Add the new larger IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890804 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [14:50:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:51:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:51:37] (03PS2) 10Alexandros Kosiaris: WikiKube codfw: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) [14:51:41] (03PS3) 10Lucas Werkmeister (WMDE): termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) [14:51:43] (03PS1) 10Lucas Werkmeister (WMDE): termbox: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/894662 (https://phabricator.wikimedia.org/T328295) [14:51:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] WikiKube codfw: Remove the old IP space (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [14:51:58] I tried to find the version, idk if it’s the right file though ^^ [14:52:26] (03Merged) 10jenkins-bot: WikiKube codfw: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [14:52:28] (03CR) 10Clément Goubert: [C: 03+1] termbox: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/894662 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:52:32] Yep, right file :D [14:52:35] yay [14:52:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] termbox: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/894662 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:53:10] (03PS2) 10Alexandros Kosiaris: WikiKube eqiad: Add the new larger IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890804 (https://phabricator.wikimedia.org/T326617) [14:53:30] (03PS27) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [14:53:32] (03CR) 10MVernon: [C: 03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/894122 (https://phabricator.wikimedia.org/T331178) (owner: 10Eevans) [14:53:52] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:55:35] (03PS28) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [14:55:57] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:56:41] (03PS2) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [14:57:40] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10eoghan) [14:57:50] !log failing grafana over to codfw T329073 [14:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:56] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [14:58:30] (03Merged) 10jenkins-bot: termbox: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/894662 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [14:58:34] (03PS29) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [14:58:51] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329260)', diff saved to https://phabricator.wikimedia.org/P45056 and previous config saved to /var/cache/conftool/dbconfig/20230306-145924-marostegui.json [14:59:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [14:59:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:59:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [14:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T329260)', diff saved to https://phabricator.wikimedia.org/P45057 and previous config saved to /var/cache/conftool/dbconfig/20230306-145945-marostegui.json [14:59:49] ok, it pulled, let’s try again [14:59:50] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:00:05] now there’s a lot more diff [15:00:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328817)', diff saved to https://phabricator.wikimedia.org/P45058 and previous config saved to /var/cache/conftool/dbconfig/20230306-150054-marostegui.json [15:00:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:01:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:01:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:01:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T328817)', diff saved to https://phabricator.wikimedia.org/P45059 and previous config saved to /var/cache/conftool/dbconfig/20230306-150115-marostegui.json [15:02:59] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:03:12] I think I’ll postpone the prod version upgrade, the window is already overrunning [15:03:16] but I should at least finish the stuff that’s merged [15:03:30] oh, the successful apply went much faster than before :D [15:03:37] jouncebot: now [15:03:37] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [15:04:26] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329260)', diff saved to https://phabricator.wikimedia.org/P45060 and previous config saved to /var/cache/conftool/dbconfig/20230306-150510-marostegui.json [15:05:17] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:05:35] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:06:01] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:06:59] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:07:54] okay, helmfile -e staging/eqiad/codfw diff --context 5 reports no more differences [15:07:56] I think I’m done for now [15:08:08] thanks for your help claime and akosiaris! [15:08:15] I’ll see when I can get around to deploying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/894599/ [15:08:17] but for now [15:08:23] !log UTC afternoon backport+config window done [15:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:26] Lucas_WMDE: I'm in the process of adding a symlink from /bin/node to /bin/nodejs [15:08:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Note for anyone else copying this change: I missed something here, see I80983b2a63." [deployment-charts] - 10https://gerrit.wikimedia.org/r/894659 (https://phabricator.wikimedia.org/T328295) (owner: 10Lucas Werkmeister (WMDE)) [15:08:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Note for anyone else copying this change: the upgrade from Node 12 to 16 also required I61238c4a10 (and I80983b2a63)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/894598 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [15:08:40] So we're backwards compatible and deployers stop running into that issue [15:09:00] ok, nice [15:09:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:09:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45061 and previous config saved to /var/cache/conftool/dbconfig/20230306-150956-marostegui.json [15:10:04] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:11:54] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7220 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:12:47] jouncebot: nowandnext [15:12:47] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [15:12:47] In 1 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1630) [15:12:56] (03PS2) 10Zabe: Add logo for azwikimedia and vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894143 (https://phabricator.wikimedia.org/T331177) [15:12:59] (03CR) 10Zabe: [C: 03+2] Add logo for azwikimedia and vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894143 (https://phabricator.wikimedia.org/T331177) (owner: 10Zabe) [15:13:23] (03CR) 10Eevans: [C: 03+2] swift: add ms-fe201[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/894122 (https://phabricator.wikimedia.org/T331178) (owner: 10Eevans) [15:13:34] (03PS1) 10Andrew Bogott: Don't page for labweb-ssl service [puppet] - 10https://gerrit.wikimedia.org/r/894664 (https://phabricator.wikimedia.org/T331197) [15:13:49] (03Merged) 10jenkins-bot: Add logo for azwikimedia and vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894143 (https://phabricator.wikimedia.org/T331177) (owner: 10Zabe) [15:14:06] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [15:14:43] !log zabe@deploy2002 Started scap: Backport for [[gerrit:894143|Add logo for azwikimedia and vewikimedia (T331177)]] [15:14:48] T331177: Set custom logo for azwikimedia and vewikimedia - https://phabricator.wikimedia.org/T331177 [15:15:21] (03PS2) 10Hnowlan: service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) [15:15:34] (03CR) 10Hnowlan: service, k8s: Add service definitions for rest-gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:15:48] godog: This will prevent a repeat of Saturday's page? https://gerrit.wikimedia.org/r/c/operations/puppet/+/894664 [15:16:31] !log zabe@deploy2002 zabe: Backport for [[gerrit:894143|Add logo for azwikimedia and vewikimedia (T331177)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:17:24] (03CR) 10David Caro: [C: 03+2] karma: make cloud alertmanager read-only [puppet] - 10https://gerrit.wikimedia.org/r/894658 (owner: 10David Caro) [15:17:57] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2013.codfw.wmnet [15:20:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P45062 and previous config saved to /var/cache/conftool/dbconfig/20230306-152017-marostegui.json [15:22:52] (03PS2) 10Ottomata: flink-kubernetes-operator - upgrade to upstream 1.4.0 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/894643 (https://phabricator.wikimedia.org/T331282) [15:23:14] (03PS4) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [15:23:16] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:894143|Add logo for azwikimedia and vewikimedia (T331177)]] (duration: 08m 33s) [15:23:22] T331177: Set custom logo for azwikimedia and vewikimedia - https://phabricator.wikimedia.org/T331177 [15:23:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/894571 (https://phabricator.wikimedia.org/T331302) [15:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:01] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ml-serve1007.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:25:46] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@2fa7484]: (no justification provided) [15:25:49] (03CR) 10Muehlenhoff: [C: 03+2] clouddumps: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/881391 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:26:03] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@2fa7484]: (no justification provided) (duration: 00m 17s) [15:26:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10bking) Update: `elastic1053-59` are have been re-racked. The remaining hosts (`elastic1060-66`, all in row D) should be finished by Wednesday. See [[ https://etherpad.w... [15:26:16] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.741 second response time https://wikitech.wikimedia.org/wiki/Swift [15:26:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2013.codfw.wmnet [15:26:54] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.937 second response time https://wikitech.wikimedia.org/wiki/Swift [15:28:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T328817)', diff saved to https://phabricator.wikimedia.org/P45063 and previous config saved to /var/cache/conftool/dbconfig/20230306-152801-marostegui.json [15:28:08] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:00] PROBLEM - Host ml-serve1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:23] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2013.codfw.wmnet [15:29:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2014.codfw.wmnet [15:30:12] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve1007.eqiad.wmnet with reason: testing provision cookbook [15:30:25] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve1007.eqiad.wmnet with reason: testing provision cookbook [15:30:31] ml-serve1007 is me, sorry for thenoise [15:30:47] (03CR) 10Clément Goubert: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894687 (owner: 10Clément Goubert) [15:31:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45064 and previous config saved to /var/cache/conftool/dbconfig/20230306-153111-marostegui.json [15:31:18] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:31:40] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:04] this is ml-serve1007 --^ [15:32:58] (KubernetesCalicoDown) firing: ml-serve1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:33:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10Ottomata) Approved. [15:34:48] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2013.codfw.wmnet [15:35:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [15:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P45065 and previous config saved to /var/cache/conftool/dbconfig/20230306-153524-marostegui.json [15:36:08] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 477 days) https://wikitech.wikimedia.org/wiki/Logs [15:36:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2014.codfw.wmnet [15:37:03] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Jelto) [15:38:18] RECOVERY - Host ml-serve1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:38:54] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:28] !log otto@deploy2002 Started deploy [analytics/refinery@d4d723a] (hadoop-test): (no justification provided) [15:39:48] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10Miriam) Approved, thanks! [15:40:55] !log otto@deploy2002 Finished deploy [analytics/refinery@d4d723a] (hadoop-test): (no justification provided) (duration: 01m 27s) [15:41:04] (03PS1) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) [15:42:58] (KubernetesCalicoDown) resolved: ml-serve1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45066 and previous config saved to /var/cache/conftool/dbconfig/20230306-154308-marostegui.json [15:43:35] (03CR) 10Ssingh: "Another way of doing this would be to get rid of the distinct dns-auth alias -- and thus updating the cookbook -- and having just one alia" [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [15:44:16] (03CR) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [15:44:25] !log otto@deploy2002 Started deploy [analytics/refinery@ee8981b] (hadoop-test): (no justification provided) [15:45:36] (03PS1) 10Herron: grafana: remove -next suffix from codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/894689 [15:45:50] !log otto@deploy2002 Finished deploy [analytics/refinery@ee8981b] (hadoop-test): (no justification provided) (duration: 01m 25s) [15:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P45067 and previous config saved to /var/cache/conftool/dbconfig/20230306-154618-marostegui.json [15:47:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:48:14] ^ anyone know what's up here? [15:48:30] Taking a look. [15:48:37] looking as well [15:48:43] looks like excess POST of some kind at first glance [15:49:11] if it's parsoid it's probably likely internal traffic [15:49:46] I don't see a clear immediate spike in incoming traffic (both in general and specifically just POSTs) [15:49:46] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/output/894689/39978/" [puppet] - 10https://gerrit.wikimedia.org/r/894689 (owner: 10Herron) [15:49:53] ok [15:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329260)', diff saved to https://phabricator.wikimedia.org/P45068 and previous config saved to /var/cache/conftool/dbconfig/20230306-155030-marostegui.json [15:50:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:50:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:50:53] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:52:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:54:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:54:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T329260)', diff saved to https://phabricator.wikimedia.org/P45069 and previous config saved to /var/cache/conftool/dbconfig/20230306-155428-marostegui.json [15:56:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329260)', diff saved to https://phabricator.wikimedia.org/P45070 and previous config saved to /var/cache/conftool/dbconfig/20230306-155638-marostegui.json [15:56:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) [15:56:45] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45071 and previous config saved to /var/cache/conftool/dbconfig/20230306-155815-marostegui.json [16:00:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Don't page for labweb-ssl service [puppet] - 10https://gerrit.wikimedia.org/r/894664 (https://phabricator.wikimedia.org/T331197) (owner: 10Andrew Bogott) [16:00:12] andrewbogott: yes that's right [16:00:32] great, thank you. I'll merge and then try another go at deploying horizon. [16:00:34] PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:44] (03CR) 10Andrew Bogott: [C: 03+2] Don't page for labweb-ssl service [puppet] - 10https://gerrit.wikimedia.org/r/894664 (https://phabricator.wikimedia.org/T331197) (owner: 10Andrew Bogott) [16:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P45072 and previous config saved to /var/cache/conftool/dbconfig/20230306-160124-marostegui.json [16:01:26] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift,name=ms-fe2013.codfw.wmnet [16:01:37] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: restrict RDF streaming updater ops alerts [alerts] - 10https://gerrit.wikimedia.org/r/894656 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:01:48] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift,name=ms-fe2013.codfw.wmnet [16:01:49] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: remove -next suffix from codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/894689 (owner: 10Herron) [16:01:56] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2013.codfw.wmnet [16:02:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) @Miriam sorry, I forgot to ask: can I confirm that this is a time-limited account, and you are the contact regarding expiry, please? And can you tel... [16:02:54] !log eevans@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe2013.codfw.wmnet [16:03:04] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2013.codfw.wmnet [16:04:55] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift,name=ms-fe2014.codfw.wmnet [16:05:07] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2014.codfw.wmnet [16:05:16] !log eevans@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe2014.codfw.wmnet [16:05:24] !log eevans@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2014.codfw.wmnet [16:05:27] (03PS3) 10Sbailey: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) [16:06:10] (03CR) 10Volans: "One question and some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:10:42] (03CR) 10Filippo Giunchedi: "I like the general idea, though I'm not 100% sold on having to manage/deviate from what the distro ships for a special/corner case like th" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [16:11:14] (03PS8) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [16:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P45073 and previous config saved to /var/cache/conftool/dbconfig/20230306-161144-marostegui.json [16:11:56] (03CR) 10Muehlenhoff: Ship custom /etc/logrotate.d/rsyslog on KDC hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [16:12:31] (03CR) 10JHathaway: Purge unused kernels on boot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T328817)', diff saved to https://phabricator.wikimedia.org/P45074 and previous config saved to /var/cache/conftool/dbconfig/20230306-161321-marostegui.json [16:13:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:13:29] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:13:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:14:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:15:08] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45075 and previous config saved to /var/cache/conftool/dbconfig/20230306-161631-marostegui.json [16:16:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:16:37] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:16:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:16:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45076 and previous config saved to /var/cache/conftool/dbconfig/20230306-161652-marostegui.json [16:17:56] (03CR) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:18:09] (03CR) 10Muehlenhoff: P:rsyslog: manage /etc/logrotate.d/rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [16:18:11] (03PS2) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) [16:18:16] (03CR) 10JHathaway: [C: 03+2] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:18:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:19:29] (03PS9) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [16:19:40] (03CR) 10JHathaway: [V: 03+2] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [16:20:12] (03CR) 10Muehlenhoff: [C: 03+2] Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894630 (https://phabricator.wikimedia.org/T331123) (owner: 10Muehlenhoff) [16:21:44] 10SRE-swift-storage: Bring ms-fe201[3-4] into service - https://phabricator.wikimedia.org/T331178 (10Eevans) This is complete. [16:21:54] PROBLEM - puppet last run on krb2001 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:40] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:24:10] RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:49] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Gehel) [16:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P45077 and previous config saved to /var/cache/conftool/dbconfig/20230306-162651-marostegui.json [16:27:44] RECOVERY - puppet last run on krb2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:28:37] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:29:29] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1007.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1630). [16:32:32] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-restbase rolling restart_daemons on A:restbase-codfw [16:32:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) @jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (... [16:37:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:38:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328817)', diff saved to https://phabricator.wikimedia.org/P45078 and previous config saved to /var/cache/conftool/dbconfig/20230306-163806-marostegui.json [16:38:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:39:10] (03PS30) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [16:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329260)', diff saved to https://phabricator.wikimedia.org/P45079 and previous config saved to /var/cache/conftool/dbconfig/20230306-164158-marostegui.json [16:42:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:42:05] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:42:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-restbase (exit_code=0) rolling restart_daemons on A:restbase-codfw [16:42:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:42:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:42:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:42:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T329260)', diff saved to https://phabricator.wikimedia.org/P45080 and previous config saved to /var/cache/conftool/dbconfig/20230306-164245-marostegui.json [16:43:07] ACKNOWLEDGEMENT - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 12688 seconds old and 217 bytes Btullis This is not a real problem. It is caused by the fact that we have failed over to the standby namenodes in preparation for T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:43:24] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:46:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329260)', diff saved to https://phabricator.wikimedia.org/P45081 and previous config saved to /var/cache/conftool/dbconfig/20230306-164808-marostegui.json [16:48:16] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:49:06] !log andrew@deploy2002 Started deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names (take two) -- T330759 [16:49:12] T330759: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759 [16:51:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:51:58] (03PS1) 10Filippo Giunchedi: Revert "karma: make cloud alertmanager read-only" [puppet] - 10https://gerrit.wikimedia.org/r/894674 [16:52:20] (03CR) 10CI reject: [V: 04-1] Revert "karma: make cloud alertmanager read-only" [puppet] - 10https://gerrit.wikimedia.org/r/894674 (owner: 10Filippo Giunchedi) [16:52:37] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10RLazarus) @akosiaris and @Clement_Goubert will come up with a cluster layout this week, and @Clement_Goubert wanted to try putting at least one or two into service themselves. Feel free to assign t... [16:52:43] (03PS2) 10Filippo Giunchedi: Revert "karma: make cloud alertmanager read-only" [puppet] - 10https://gerrit.wikimedia.org/r/894674 [16:52:52] dcaro: FYI ^ [16:53:07] oh, looking [16:53:11] godog: fyi https://github.com/prymitive/karma/pull/5086 [16:53:30] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:53:37] oooooh, I see [16:53:48] very cool re: PR [16:53:53] (03CR) 10David Caro: [C: 03+2] Revert "karma: make cloud alertmanager read-only" [puppet] - 10https://gerrit.wikimedia.org/r/894674 (owner: 10Filippo Giunchedi) [16:54:06] (03CR) 10David Caro: [C: 03+2] Revert "karma: make cloud alertmanager read-only" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894674 (owner: 10Filippo Giunchedi) [16:54:26] !log andrew@deploy2002 Finished deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names (take two) -- T330759 (duration: 05m 19s) [16:54:32] T330759: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759 [16:54:35] dcaro: I think it is fine to leave proxy on but remove readonly for now ? [16:54:44] or you'd rather remove both ? [16:54:52] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:04] godog: the issue is that when you try to create a silence, it will try to create it also on the cloud alertmanager and show an error message [16:55:11] not a big issue though, it will create it anyhow [16:55:32] (shows an ugly "unauthorized" html text in a box xd) [16:55:56] ah mhhh ok yeah I can see how that's going to be confusing [16:56:30] but it'll work without proxying dcaro ? [16:56:42] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:57] godog: I think yes, can we try it out? [16:58:19] dcaro: for sure, do you mind doing the honors? I'm going to jump into the sre meeting in two minutes [16:58:36] or going ahead with just the read only revert is fine, whichever you prefer [16:58:53] I'll test 👍 [16:58:57] (03PS1) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-codfw cluster 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/894697 (https://phabricator.wikimedia.org/T313874) [16:59:04] SGTM, cheers [16:59:36] (03PS2) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/894697 (https://phabricator.wikimedia.org/T313874) [17:01:14] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Enable ESI testing in cp4044" [puppet] - 10https://gerrit.wikimedia.org/r/894557 (owner: 10Vgutierrez) [17:01:46] (03PS1) 10David Caro: karma: fix usage of proxy when readonly [puppet] - 10https://gerrit.wikimedia.org/r/894698 [17:01:50] tested, it works, sending patch [17:02:38] (03CR) 10David Caro: [C: 03+2] karma: fix usage of proxy when readonly [puppet] - 10https://gerrit.wikimedia.org/r/894698 (owner: 10David Caro) [17:03:01] (03Abandoned) 10David Caro: Revert "karma: make cloud alertmanager read-only" [puppet] - 10https://gerrit.wikimedia.org/r/894674 (owner: 10Filippo Giunchedi) [17:03:11] (03CR) 10Herron: [C: 03+2] grafana: remove -next suffix from codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/894689 (owner: 10Herron) [17:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P45082 and previous config saved to /var/cache/conftool/dbconfig/20230306-170315-marostegui.json [17:06:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328817)', diff saved to https://phabricator.wikimedia.org/P45083 and previous config saved to /var/cache/conftool/dbconfig/20230306-170657-marostegui.json [17:07:05] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:11:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:13:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frpm1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T329752 (10Jgreen) [17:13:10] dcaro: thank you <3 [17:17:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45084 and previous config saved to /var/cache/conftool/dbconfig/20230306-171708-marostegui.json [17:17:15] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P45085 and previous config saved to /var/cache/conftool/dbconfig/20230306-171821-marostegui.json [17:21:33] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:21:56] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10BBlack) 05Resolved→03Open >>! In T330906#8661013, @Ennomeijers wrote: > As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP ver... [17:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45086 and previous config saved to /var/cache/conftool/dbconfig/20230306-172205-marostegui.json [17:22:11] (03PS1) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) [17:24:24] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [17:25:56] (03PS1) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/894701 (https://phabricator.wikimedia.org/T313874) [17:29:30] (03PS4) 10Jbond: mod_auth_cas: add logout script for mod_auth_cas [puppet] - 10https://gerrit.wikimedia.org/r/695255 [17:32:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P45087 and previous config saved to /var/cache/conftool/dbconfig/20230306-173215-marostegui.json [17:33:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329260)', diff saved to https://phabricator.wikimedia.org/P45088 and previous config saved to /var/cache/conftool/dbconfig/20230306-173328-marostegui.json [17:33:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:33:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:33:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45089 and previous config saved to /var/cache/conftool/dbconfig/20230306-173350-marostegui.json [17:34:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [17:35:25] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ssingh) [17:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45090 and previous config saved to /var/cache/conftool/dbconfig/20230306-173711-marostegui.json [17:38:31] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) 05Open→03Resolved [17:38:39] 10SRE, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Clement_Goubert) [17:39:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45091 and previous config saved to /var/cache/conftool/dbconfig/20230306-173927-marostegui.json [17:39:32] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) p:05Triage→03Medium [17:39:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:40:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39982/console" [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [17:40:33] (03PS4) 10Sbailey: Enable new Linter UI for namespace, tag and template for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) [17:41:09] (03PS5) 10Sbailey: Enable new Linter UI for namespace, tag and template for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) [17:41:36] (03CR) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [17:41:43] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:42:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:46:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) [17:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P45092 and previous config saved to /var/cache/conftool/dbconfig/20230306-174721-marostegui.json [17:52:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328817)', diff saved to https://phabricator.wikimedia.org/P45093 and previous config saved to /var/cache/conftool/dbconfig/20230306-175218-marostegui.json [17:52:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:52:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:52:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:52:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:52:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:52:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328817)', diff saved to https://phabricator.wikimedia.org/P45094 and previous config saved to /var/cache/conftool/dbconfig/20230306-175254-marostegui.json [17:54:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P45095 and previous config saved to /var/cache/conftool/dbconfig/20230306-175433-marostegui.json [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1800) [18:00:05] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T1800). [18:02:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45096 and previous config saved to /var/cache/conftool/dbconfig/20230306-180228-marostegui.json [18:02:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:02:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:02:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T329203)', diff saved to https://phabricator.wikimedia.org/P45097 and previous config saved to /var/cache/conftool/dbconfig/20230306-180249-marostegui.json [18:05:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10jbond) >>! In T326352#8669104, @MatthewVernon wrote: > @jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one... [18:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P45098 and previous config saved to /var/cache/conftool/dbconfig/20230306-180940-marostegui.json [18:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328817)', diff saved to https://phabricator.wikimedia.org/P45099 and previous config saved to /var/cache/conftool/dbconfig/20230306-181017-marostegui.json [18:10:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:12:28] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [18:21:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1035'] [18:21:39] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [18:21:57] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035'] [18:23:28] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [18:24:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T329203)', diff saved to https://phabricator.wikimedia.org/P45100 and previous config saved to /var/cache/conftool/dbconfig/20230306-182402-marostegui.json [18:24:08] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45101 and previous config saved to /var/cache/conftool/dbconfig/20230306-182447-marostegui.json [18:24:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:24:53] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:25:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45102 and previous config saved to /var/cache/conftool/dbconfig/20230306-182508-marostegui.json [18:25:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45103 and previous config saved to /var/cache/conftool/dbconfig/20230306-182524-marostegui.json [18:25:43] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1036'] [18:30:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45104 and previous config saved to /var/cache/conftool/dbconfig/20230306-183040-marostegui.json [18:30:48] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:33:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1035'] [18:34:19] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [18:34:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035'] [18:34:45] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [18:38:15] !log phabricator - locked and archived project acl*discovery-repository-admins (T324171) [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:22] T324171: Audit Diffusion-Repository-Administrators group membership and rights - https://phabricator.wikimedia.org/T324171 [18:39:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P45105 and previous config saved to /var/cache/conftool/dbconfig/20230306-183908-marostegui.json [18:40:18] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035'] [18:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45106 and previous config saved to /var/cache/conftool/dbconfig/20230306-184030-marostegui.json [18:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P45107 and previous config saved to /var/cache/conftool/dbconfig/20230306-184547-marostegui.json [18:54:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P45108 and previous config saved to /var/cache/conftool/dbconfig/20230306-185415-marostegui.json [18:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328817)', diff saved to https://phabricator.wikimedia.org/P45109 and previous config saved to /var/cache/conftool/dbconfig/20230306-185537-marostegui.json [18:55:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:55:44] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:55:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:55:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328817)', diff saved to https://phabricator.wikimedia.org/P45110 and previous config saved to /var/cache/conftool/dbconfig/20230306-185559-marostegui.json [18:56:25] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1036'] [18:58:00] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Ennomeijers) Ok, I see your point. As long as the concept/canonical URIs for all entities are being published as http:// URIs there is no other way than following the 301 redir... [19:00:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P45111 and previous config saved to /var/cache/conftool/dbconfig/20230306-190054-marostegui.json [19:07:27] (03PS1) 10Ilias Sarantopoulos: httpbb: add tests for nsfw model on liftwing [puppet] - 10https://gerrit.wikimedia.org/r/894714 [19:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T329203)', diff saved to https://phabricator.wikimedia.org/P45112 and previous config saved to /var/cache/conftool/dbconfig/20230306-190921-marostegui.json [19:09:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [19:09:29] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:09:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [19:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T329203)', diff saved to https://phabricator.wikimedia.org/P45113 and previous config saved to /var/cache/conftool/dbconfig/20230306-190943-marostegui.json [19:10:31] (03CR) 10Dzahn: "please see modules/profile/manifests/httpbb.pp as well. it will also need a "httpbb::test_suite" in there to actually install these on dep" [puppet] - 10https://gerrit.wikimedia.org/r/894714 (owner: 10Ilias Sarantopoulos) [19:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329260)', diff saved to https://phabricator.wikimedia.org/P45114 and previous config saved to /var/cache/conftool/dbconfig/20230306-191600-marostegui.json [19:16:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:16:07] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:16:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:16:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329260)', diff saved to https://phabricator.wikimedia.org/P45115 and previous config saved to /var/cache/conftool/dbconfig/20230306-191622-marostegui.json [19:18:07] (03CR) 10Dzahn: "seems to me like cgoubert was interested in this as well and mentioned it during DC switchover somewhere. the part about allowing multiple" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [19:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329260)', diff saved to https://phabricator.wikimedia.org/P45116 and previous config saved to /var/cache/conftool/dbconfig/20230306-191835-marostegui.json [19:19:00] (03PS1) 10Jbond: openstack: basic bookworm classes [puppet] - 10https://gerrit.wikimedia.org/r/894717 [19:19:59] 10SRE, 10Traffic, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) I suspect we generate little revenue for them and I don't see any sort of "Businesses that rely on Shopify" section on their site (they seem to prefer showing... [19:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328817)', diff saved to https://phabricator.wikimedia.org/P45117 and previous config saved to /var/cache/conftool/dbconfig/20230306-192322-marostegui.json [19:23:30] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:28:25] (03CR) 10Ilias Sarantopoulos: httpbb: add tests for nsfw model on liftwing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894714 (owner: 10Ilias Sarantopoulos) [19:31:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T329203)', diff saved to https://phabricator.wikimedia.org/P45118 and previous config saved to /var/cache/conftool/dbconfig/20230306-193123-marostegui.json [19:31:31] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:33:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P45119 and previous config saved to /var/cache/conftool/dbconfig/20230306-193341-marostegui.json [19:34:39] (03CR) 10D3r1ck01: [C: 03+2] proton: Use latest image build [deployment-charts] - 10https://gerrit.wikimedia.org/r/894718 (owner: 10D3r1ck01) [19:38:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45120 and previous config saved to /var/cache/conftool/dbconfig/20230306-193829-marostegui.json [19:39:22] (03Merged) 10jenkins-bot: proton: Use latest image build [deployment-charts] - 10https://gerrit.wikimedia.org/r/894718 (owner: 10D3r1ck01) [19:39:58] (03CR) 10Dzahn: httpbb: add tests for nsfw model on liftwing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894714 (owner: 10Ilias Sarantopoulos) [19:44:50] !log derick@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [19:45:50] !log derick@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [19:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P45121 and previous config saved to /var/cache/conftool/dbconfig/20230306-194630-marostegui.json [19:47:06] !log derick@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [19:48:35] !log derick@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [19:48:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P45122 and previous config saved to /var/cache/conftool/dbconfig/20230306-194848-marostegui.json [19:49:42] !log derick@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [19:51:30] !log derick@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [19:53:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45123 and previous config saved to /var/cache/conftool/dbconfig/20230306-195336-marostegui.json [20:01:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P45124 and previous config saved to /var/cache/conftool/dbconfig/20230306-200136-marostegui.json [20:02:07] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [20:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329260)', diff saved to https://phabricator.wikimedia.org/P45125 and previous config saved to /var/cache/conftool/dbconfig/20230306-200354-marostegui.json [20:04:03] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:04:25] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [20:08:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328817)', diff saved to https://phabricator.wikimedia.org/P45126 and previous config saved to /var/cache/conftool/dbconfig/20230306-200843-marostegui.json [20:08:52] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:14:28] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:16:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T329203)', diff saved to https://phabricator.wikimedia.org/P45127 and previous config saved to /var/cache/conftool/dbconfig/20230306-201643-marostegui.json [20:16:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [20:16:53] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:16:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [20:17:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T329203)', diff saved to https://phabricator.wikimedia.org/P45128 and previous config saved to /var/cache/conftool/dbconfig/20230306-201704-marostegui.json [20:36:04] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T329203)', diff saved to https://phabricator.wikimedia.org/P45129 and previous config saved to /var/cache/conftool/dbconfig/20230306-203816-marostegui.json [20:38:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:45:41] (03PS1) 10DCausse: Properly pass the page id on page moves [extensions/CirrusSearch] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/894677 (https://phabricator.wikimedia.org/T331127) [20:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P45130 and previous config saved to /var/cache/conftool/dbconfig/20230306-205322-marostegui.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T2100) [21:00:05] sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] I am here :-) [21:00:28] I can deploy [21:00:52] Great, config change enabled dark launched linter UI code [21:01:00] thanks [21:01:22] (03CR) 10Zabe: [C: 03+2] Enable new Linter UI for namespace, tag and template for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:01:27] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10TheDJ) The problem is identifiers vs urls. An identifier is stable. A url might not be. If you start using locators as identifiers.... things become gray. Then again. The spec... [21:02:08] (03Merged) 10jenkins-bot: Enable new Linter UI for namespace, tag and template for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:02:50] !log zabe@deploy2002 Started scap: Backport for [[gerrit:893811|Enable new Linter UI for namespace, tag and template for group0 wikis (T299612)]] [21:03:03] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:04:02] 10SRE, 10Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10BCornwall) p:05Triage→03High [21:04:33] !log zabe@deploy2002 zabe and sbailey: Backport for [[gerrit:893811|Enable new Linter UI for namespace, tag and template for group0 wikis (T299612)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:04:35] sbailey: can you test? [21:05:02] looking at test.wikipedia.org. Should see the new UI in SPecial Pages: Linter [21:05:32] looking now. Does this need to be synced for the config to be enabled on test.wikipedia.org [21:07:22] the config change has been synced to the debug hosts so that they can be tested, have you tested changes through those before? [21:08:01] I have not, only though Beta for the last week. What is the debug host URL [21:08:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P45131 and previous config saved to /var/cache/conftool/dbconfig/20230306-210829-marostegui.json [21:08:34] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:09:06] sbailey: you can connect to the debug hosts with a browser extension, see https://wikitech.wikimedia.org/wiki/WikimediaDebug [21:10:12] I will study how to use the debug hosts, but in the mean time, lets sync group 0, confident it is safe [21:13:43] (tested myself, syncing) [21:15:31] SHould see new tag and template search fields in UI [21:18:58] It is live now, can see the new UI form elements :-). Thanks Zabe, playing with the debug extension [21:19:09] yw [21:19:42] Any recommendations on what URL to pass in while the debug extension is the interface? [21:19:50] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:893811|Enable new Linter UI for namespace, tag and template for group0 wikis (T299612)]] (duration: 16m 59s) [21:19:56] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:20:03] I need to depoy to group 1 and 2 over the week [21:21:52] when you enable the browser extension, you can normally browse the pages and test the new config. The extension automatically sets the header such that the requests gets directed to the debug hosts. [21:22:09] So you can then just go to Special:LintErrors and test the new config [21:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T329203)', diff saved to https://phabricator.wikimedia.org/P45132 and previous config saved to /var/cache/conftool/dbconfig/20230306-212336-marostegui.json [21:23:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [21:23:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:23:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [21:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T329203)', diff saved to https://phabricator.wikimedia.org/P45133 and previous config saved to /var/cache/conftool/dbconfig/20230306-212358-marostegui.json [21:24:01] Ah ok, great, so test.wikipedia.org URL will go to test host when the extension is enabled for the page. Got it, that was simpler than I had expected [21:24:12] Again thanks @zabe [21:26:40] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10BBlack) 05Open→03Resolved The redirects are neither //good// nor //bad//, they're instead both necessary (although that necessity is waning) and insecure. We thought we ha... [21:32:18] (03PS1) 10JHathaway: kernel-purge: enable [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) [21:37:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [21:41:10] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:41:26] (03CR) 10JHathaway: "I collected the unique output from running kernel-purge -l, along with a sample host for a given output. From my read the script looks saf" [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [21:45:18] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [21:45:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T329203)', diff saved to https://phabricator.wikimedia.org/P45135 and previous config saved to /var/cache/conftool/dbconfig/20230306-214524-marostegui.json [21:45:31] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:46:11] 10ops-esams: Audit future knams power usage - https://phabricator.wikimedia.org/T331358 (10RobH) p:05Triage→03Medium [21:46:27] 10SRE, 10ops-esams, 10DC-Ops: Audit future knams power usage - https://phabricator.wikimedia.org/T331358 (10RobH) [21:58:56] (03PS1) 10Sbailey: Enable new Linter UI for namespace, tag and template for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) [21:59:48] (03PS9) 10Brennen Bearnes: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (https://phabricator.wikimedia.org/T329908) (owner: 10EoghanGaffney) [22:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230306T2200). [22:00:13] (03CR) 10Sbailey: "Rolling out new Linter UI to group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [22:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P45136 and previous config saved to /var/cache/conftool/dbconfig/20230306-220031-marostegui.json [22:13:32] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P45137 and previous config saved to /var/cache/conftool/dbconfig/20230306-221537-marostegui.json [22:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T329203)', diff saved to https://phabricator.wikimedia.org/P45138 and previous config saved to /var/cache/conftool/dbconfig/20230306-223044-marostegui.json [22:30:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:30:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:30:52] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:35:14] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:37:35] (03PS1) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 [22:38:00] (03CR) 10CI reject: [V: 04-1] search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (owner: 10Bking) [22:49:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [22:49:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [22:50:01] (03PS2) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 [22:51:09] (03PS3) 10Ryan Kemper: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (owner: 10Bking) [22:51:26] (03PS4) 10Ryan Kemper: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [22:52:07] (03PS1) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [22:53:04] 10SRE, 10Traffic: ATS: origins server response data accounting issues - https://phabricator.wikimedia.org/T284290 (10BCornwall) 05Open→03Invalid Considering that over two years this doesn't seem to have cropped up, I don't think it's worth keeping open unless this becomes a problem again. The grafana metri... [22:53:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) [22:53:26] (03PS5) 10Ryan Kemper: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [22:54:34] (03PS2) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [22:57:26] (03PS3) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [22:59:16] (03PS4) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [23:04:57] !log bking@cumin2002 'depool wcqs and wdqs row A hosts T329073' [23:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:04] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [23:05:23] !log T329073 Pre-emptively depooled internal wdqs hosts `wdqs10[03,11]` [23:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:44] (03CR) 10Dzahn: [C: 03+1] "looks good to me. compiler output looks good to. mega nitpick: trailing whitespace in lines 6 and 21 of config.pp" [puppet] - 10https://gerrit.wikimedia.org/r/891841 (https://phabricator.wikimedia.org/T329908) (owner: 10EoghanGaffney) [23:09:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [23:09:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [23:11:31] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@53a0280]: (no justification provided) [23:11:47] 10SRE, 10Traffic: Drop the VarnishTrafficDrop and HAProxyEdgeTrafficDrop alerts - https://phabricator.wikimedia.org/T322220 (10BCornwall) 05Open→03Resolved a:03BCornwall [23:11:48] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@53a0280]: (no justification provided) (duration: 00m 17s) [23:14:04] 10SRE, 10Scap, 10serviceops-collab, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 (10Dzahn) [23:16:00] !log bking@cumin2002 ban row A cloudelastic hosts T329073 [23:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:06] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [23:19:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: switch maintenance [23:19:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 12 hosts with reason: switch maintenance [23:19:55] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=786ee8c7-4753-4e2d-96f9-8b55b691ff09) set by bking@cumin2002 for 1 day, 0:00:00 on 12 ho... [23:20:07] (03PS5) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [23:20:15] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs1001.eqiad.wmnet,wdqs[1003-1004,1006,1011].eqiad.wmnet with reason: switch maintenance [23:20:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs1001.eqiad.wmnet,wdqs[1003-1004,1006,1011].eqiad.wmnet with reason: switch maintenance [23:21:00] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f9f1bd07-4af1-41e3-82b7-3ab0f2ff8672) set by bking@cumin2002 for 1 day, 0:00:00 on 5 hos... [23:22:25] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10bking) [23:25:16] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10RKemper) [23:28:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [23:29:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:29:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T329203)', diff saved to https://phabricator.wikimedia.org/P45139 and previous config saved to /var/cache/conftool/dbconfig/20230306-232933-marostegui.json [23:29:40] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:37:04] (03PS1) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) [23:37:25] (03CR) 10CI reject: [V: 04-1] peopleweb: ensure each user automatically gets a public_html dir [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:39:09] (03PS2) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) [23:43:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:44:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T329203)', diff saved to https://phabricator.wikimedia.org/P45140 and previous config saved to /var/cache/conftool/dbconfig/20230306-235006-marostegui.json [23:50:13] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203