[00:03:09] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [00:03:46] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) 05Open→03Resolved It turn out that the issue that was making the R450 to fail during provisioning was 1 - The BIOS was set to UEFI 2 - The Serial communication settings were differen... [00:05:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:05:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudmetrics[1001-1002].eqiad.wmnet [00:05:20] 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) I had a chat with @jbond in IRC he is looking into this. [00:06:02] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Andrew) a:05Andrew→03Jclark-ctr [00:06:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance [00:06:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance [00:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39861 and previous config saved to /var/cache/conftool/dbconfig/20221116-000645-marostegui.json [00:06:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:07:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:11:54] (03PS4) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [00:14:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2004.codfw.wmnet with OS bullseye [00:14:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye [00:18:42] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:19:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:20:44] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:21:00] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:26:52] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage [00:31:43] (03CR) 10Cwhite: [C: 03+1] netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [00:32:37] (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [00:33:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage [00:40:38] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:40:41] (03PS5) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [00:41:17] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [00:41:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39862 and previous config saved to /var/cache/conftool/dbconfig/20221116-004130-marostegui.json [00:41:36] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:42:56] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:43:07] (03PS6) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [00:43:43] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [00:45:26] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:46:17] (03PS7) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [00:46:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:53:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2004.codfw.wmnet with OS bullseye [00:53:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye completed: - dbprov2004 (**WARN**)... [00:54:08] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39863 and previous config saved to /var/cache/conftool/dbconfig/20221116-005636-marostegui.json [00:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318605)', diff saved to https://phabricator.wikimedia.org/P39864 and previous config saved to /var/cache/conftool/dbconfig/20221116-005921-ladsgroup.json [00:59:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:03:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [01:03:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [01:03:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39865 and previous config saved to /var/cache/conftool/dbconfig/20221116-010330-ladsgroup.json [01:06:38] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) 05Open→03Resolved The R650 is working fine no issue to report on my end. The only problem and I think we know already about it is that the server has 1 power... [01:10:30] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [01:11:24] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [01:11:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39866 and previous config saved to /var/cache/conftool/dbconfig/20221116-011143-marostegui.json [01:14:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P39867 and previous config saved to /var/cache/conftool/dbconfig/20221116-011427-ladsgroup.json [01:15:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [01:19:10] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:23:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:23:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:26:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:26:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39869 and previous config saved to /var/cache/conftool/dbconfig/20221116-012649-marostegui.json [01:26:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance [01:26:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:27:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance [01:27:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [01:27:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [01:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39870 and previous config saved to /var/cache/conftool/dbconfig/20221116-012726-marostegui.json [01:28:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:29:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:29:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P39871 and previous config saved to /var/cache/conftool/dbconfig/20221116-012934-ladsgroup.json [01:30:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:36:52] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:44] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:27] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye [01:43:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [01:44:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318605)', diff saved to https://phabricator.wikimedia.org/P39872 and previous config saved to /var/cache/conftool/dbconfig/20221116-014441-ladsgroup.json [01:44:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [01:44:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:44:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [01:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39873 and previous config saved to /var/cache/conftool/dbconfig/20221116-014502-ladsgroup.json [01:47:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:48:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39874 and previous config saved to /var/cache/conftool/dbconfig/20221116-015223-marostegui.json [01:52:29] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:14] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39875 and previous config saved to /var/cache/conftool/dbconfig/20221116-020730-marostegui.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:06] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:26] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [02:19:58] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39876 and previous config saved to /var/cache/conftool/dbconfig/20221116-022236-marostegui.json [02:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39877 and previous config saved to /var/cache/conftool/dbconfig/20221116-023101-ladsgroup.json [02:31:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39878 and previous config saved to /var/cache/conftool/dbconfig/20221116-023743-marostegui.json [02:37:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance [02:37:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [02:38:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance [02:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39879 and previous config saved to /var/cache/conftool/dbconfig/20221116-023815-marostegui.json [02:46:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39880 and previous config saved to /var/cache/conftool/dbconfig/20221116-024608-ladsgroup.json [02:57:38] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye [03:01:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39881 and previous config saved to /var/cache/conftool/dbconfig/20221116-030115-ladsgroup.json [03:12:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39882 and previous config saved to /var/cache/conftool/dbconfig/20221116-031230-marostegui.json [03:12:36] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:16:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39883 and previous config saved to /var/cache/conftool/dbconfig/20221116-031621-ladsgroup.json [03:16:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [03:16:27] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:16:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [03:16:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39884 and previous config saved to /var/cache/conftool/dbconfig/20221116-031642-ladsgroup.json [03:21:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39885 and previous config saved to /var/cache/conftool/dbconfig/20221116-032111-ladsgroup.json [03:27:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P39886 and previous config saved to /var/cache/conftool/dbconfig/20221116-032737-marostegui.json [03:36:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P39887 and previous config saved to /var/cache/conftool/dbconfig/20221116-033617-ladsgroup.json [03:42:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P39888 and previous config saved to /var/cache/conftool/dbconfig/20221116-034243-marostegui.json [03:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P39889 and previous config saved to /var/cache/conftool/dbconfig/20221116-035124-ladsgroup.json [03:57:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39890 and previous config saved to /var/cache/conftool/dbconfig/20221116-035750-marostegui.json [03:57:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [04:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39891 and previous config saved to /var/cache/conftool/dbconfig/20221116-040630-ladsgroup.json [04:06:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [04:06:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:06:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [04:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39892 and previous config saved to /var/cache/conftool/dbconfig/20221116-040652-ladsgroup.json [04:07:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:18:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:49:39] !log on mwmaint1002: running storageTypeStats.php on dewiki [04:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39893 and previous config saved to /var/cache/conftool/dbconfig/20221116-050354-ladsgroup.json [05:04:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39894 and previous config saved to /var/cache/conftool/dbconfig/20221116-051901-ladsgroup.json [05:34:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39895 and previous config saved to /var/cache/conftool/dbconfig/20221116-053407-ladsgroup.json [05:37:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39896 and previous config saved to /var/cache/conftool/dbconfig/20221116-053734-ladsgroup.json [05:37:40] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:49:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39897 and previous config saved to /var/cache/conftool/dbconfig/20221116-054914-ladsgroup.json [05:49:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [05:49:19] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:49:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [05:49:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39898 and previous config saved to /var/cache/conftool/dbconfig/20221116-054935-ladsgroup.json [05:52:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39899 and previous config saved to /var/cache/conftool/dbconfig/20221116-055241-ladsgroup.json [06:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39900 and previous config saved to /var/cache/conftool/dbconfig/20221116-060747-ladsgroup.json [06:19:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39901 and previous config saved to /var/cache/conftool/dbconfig/20221116-062253-ladsgroup.json [06:22:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [06:22:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:23:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [06:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39902 and previous config saved to /var/cache/conftool/dbconfig/20221116-062315-ladsgroup.json [06:31:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:34:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:35:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [06:35:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [06:36:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance [06:36:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance [06:46:46] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [06:48:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:55:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:42] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39903 and previous config saved to /var/cache/conftool/dbconfig/20221116-071420-ladsgroup.json [07:14:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39904 and previous config saved to /var/cache/conftool/dbconfig/20221116-072926-ladsgroup.json [07:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39905 and previous config saved to /var/cache/conftool/dbconfig/20221116-074433-ladsgroup.json [07:52:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39906 and previous config saved to /var/cache/conftool/dbconfig/20221116-075204-ladsgroup.json [07:52:09] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:57:58] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:12] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,wmf_auto_restart_apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:34] PROBLEM - mailman archives on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:42] PROBLEM - mailman list info on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:48] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39907 and previous config saved to /var/cache/conftool/dbconfig/20221116-075940-ladsgroup.json [07:59:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [07:59:45] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:59:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [08:00:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39908 and previous config saved to /var/cache/conftool/dbconfig/20221116-080001-ladsgroup.json [08:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [08:01:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [08:02:38] (03CR) 10Filippo Giunchedi: "I'll let Cole vote but LGTM from a quick look" [puppet] - 10https://gerrit.wikimedia.org/r/855719 (https://phabricator.wikimedia.org/T319020) (owner: 10Ryan Kemper) [08:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39909 and previous config saved to /var/cache/conftool/dbconfig/20221116-080710-ladsgroup.json [08:07:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:19:05] (03PS1) 10Matthias Mullie: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) [08:21:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39910 and previous config saved to /var/cache/conftool/dbconfig/20221116-082217-ladsgroup.json [08:27:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:28] (03CR) 10CI reject: [V: 04-1] Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie) [08:35:59] (03PS2) 10Matthias Mullie: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) [08:36:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [08:37:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39911 and previous config saved to /var/cache/conftool/dbconfig/20221116-083723-ladsgroup.json [08:37:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:37:28] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:37:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [08:45:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1022.eqiad.wmnet to cluster eqiad and group D [08:47:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1022.eqiad.wmnet to cluster eqiad and group D [08:51:04] RECOVERY - cassandra-b CQL 10.64.48.122:9042 on aqs1019 is OK: TCP OK - 0.001 second response time on 10.64.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [09:13:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45899 [09:13:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45899 [09:16:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30844 [09:16:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30844 [09:16:53] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [09:17:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 293 [09:18:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:24] (03PS3) 10Filippo Giunchedi: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) [09:18:26] (03PS3) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) [09:18:28] (03PS3) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) [09:18:30] (03PS1) 10Filippo Giunchedi: pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) [09:18:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 293 [09:18:49] (03PS2) 10Muehlenhoff: Add Cumin alias for dispatch [puppet] - 10https://gerrit.wikimedia.org/r/857015 [09:22:15] (03PS1) 10Elukey: turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) [09:22:36] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:07] (03PS1) 10Cathal Mooney: Change get_underlay_ints() to use Netbox VRF field for filtering [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) [09:26:38] (03PS1) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [09:28:26] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39912 and previous config saved to /var/cache/conftool/dbconfig/20221116-093112-ladsgroup.json [09:31:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:34:25] (03CR) 10FNegri: [C: 03+1] "LGTM, which command was failing before this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott) [09:34:48] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to update in terms of the LVS connections. After discussing with Brandon I thought it best if the links from all 4 LVS terminate on diff... [09:37:36] (03PS2) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [09:38:16] (03PS3) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [09:46:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39913 and previous config saved to /var/cache/conftool/dbconfig/20221116-094618-ladsgroup.json [09:46:42] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:47:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10jcrespo) > I think we know already about it is that the server has 1 power supply on the left and the other one on the right Please be sure to comment it with @RobH so h... [09:48:45] (03PS1) 10Hashar: gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445) [09:48:56] (03CR) 10JMeybohm: [C: 03+2] k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:48:59] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:49:18] (03Abandoned) 10Hashar: gerrit: remove Gerrit 3.5 obsolete @apply css statement [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [09:49:26] (03CR) 10Hashar: [C: 03+2] gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [09:49:54] (03Merged) 10jenkins-bot: gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [09:50:32] (03CR) 10Filippo Giunchedi: [C: 03+1] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:56:11] (03PS1) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) [09:56:24] (03CR) 10Muehlenhoff: "Looks good! A few remaining nits/typos and one suggestion for an additional test case (but we can also simply add that to a subsequent rel" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [09:59:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [09:59:59] MatmaRex: fyi, the script is still running, currently on commonswiki (Processed 4012200 (updated 230542) of 118789703 rows) [10:00:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39914 and previous config saved to /var/cache/conftool/dbconfig/20221116-100027-ladsgroup.json [10:00:32] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:00:33] taavi: yep, thank you [10:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39915 and previous config saved to /var/cache/conftool/dbconfig/20221116-100125-ladsgroup.json [10:02:19] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 3 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) [10:03:09] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [10:03:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857014 (owner: 10Muehlenhoff) [10:04:12] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 3 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) 05In progress→03Resolved With the basic selector announced yesterday i think we have all actions complete so will re... [10:04:29] (03PS2) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) [10:05:49] !log kevinbazira@deploy1002 Started deploy [ores/deploy@0114799]: T319373 [10:05:54] T319373: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 [10:06:22] (03PS2) 10Muehlenhoff: Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/857014 [10:11:56] (03CR) 10Muehlenhoff: [C: 03+2] Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/857014 (owner: 10Muehlenhoff) [10:14:54] (03CR) 10Effie Mouzeli: [V: 04-1] "PCC Fails https://puppet-compiler.wmflabs.org/output/857505/38220/" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [10:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39916 and previous config saved to /var/cache/conftool/dbconfig/20221116-101631-ladsgroup.json [10:16:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:16:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:16:41] !log kevinbazira@deploy1002 Finished deploy [ores/deploy@0114799]: T319373 (duration: 10m 51s) [10:16:45] T319373: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 [10:16:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39917 and previous config saved to /var/cache/conftool/dbconfig/20221116-101653-ladsgroup.json [10:17:57] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:30] (03PS1) 10Filippo Giunchedi: prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) [10:22:06] (03CR) 10CI reject: [V: 04-1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [10:23:49] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:59] (03PS2) 10Filippo Giunchedi: prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) [10:24:35] (03CR) 10CI reject: [V: 04-1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [10:25:25] (03PS7) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [10:28:49] (03CR) 10Hnowlan: "lgtm, one nit" [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [10:29:34] !log restarting apache on lists.wm.o [10:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:06] !log Run `mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php` for all wikis in growthexperiments.dblist at mwmaint1002 (T318457) [10:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:10] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [10:30:14] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [10:30:29] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38221/console" [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [10:31:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:32:14] (03PS1) 10Filippo Giunchedi: prometheus: default to valid external url [puppet] - 10https://gerrit.wikimedia.org/r/857522 (https://phabricator.wikimedia.org/T301944) [10:32:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:32:31] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-12-22 06:15:55 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [10:36:25] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:32] (03CR) 10Ladsgroup: Add Cumin alias for orchestrator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff) [10:43:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas) [10:47:52] (03CR) 10JMeybohm: [C: 03+1] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:49:08] (03CR) 10JMeybohm: [C: 03+1] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:49:40] (03CR) 10JMeybohm: [C: 03+1] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:51:07] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:51:16] (03PS4) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) [10:51:21] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:51:31] (03CR) 10JMeybohm: [C: 03+1] pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:51:44] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:51:54] (03PS4) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) [10:51:56] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:52:11] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:53:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:19] 10SRE, 10Wikimedia-Mailing-lists: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Ladsgroup) [10:57:01] (03PS1) 10Phedenskog: Update phedenskogs keys. [puppet] - 10https://gerrit.wikimedia.org/r/857529 [10:59:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:57] 10SRE, 10Wikimedia-Mailing-lists: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Ladsgroup) p:05Triage→03High [11:03:21] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10jcrespo) [11:06:05] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10jcrespo) I am marking this as an incident, as lists were down for around 2.5h. Although it could also be considered an #wikimedia-incident-actiona... [11:10:03] (03CR) 10Elukey: [C: 03+1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [11:11:38] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Vgutierrez) hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet [11:12:05] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:13:34] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/856969/38222/" [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:13:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:14:05] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) [11:14:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye [11:14:25] (03CR) 10JMeybohm: [C: 03+1] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [11:14:36] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet I believe the updated tit... [11:14:51] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw1001: prepare for reimage into the new vlan NIC name with a single NIC [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:17:50] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [11:26:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [11:27:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:27:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:29:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [11:29:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [11:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39918 and previous config saved to /var/cache/conftool/dbconfig/20221116-113108-ladsgroup.json [11:31:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:31:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [11:31:45] (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:33:02] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:34:59] (03PS1) 10Filippo Giunchedi: benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) [11:35:01] (03PS1) 10Filippo Giunchedi: benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) [11:36:43] (03CR) 10FNegri: [C: 03+1] ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [11:39:27] (03PS5) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [11:40:06] (03PS1) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547 [11:40:54] (03PS2) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547 [11:40:58] (03PS6) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [11:45:08] (03CR) 10Muehlenhoff: [C: 03+2] buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547 (owner: 10Muehlenhoff) [11:45:51] (03CR) 10Effie Mouzeli: maps: add support for replication slots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [11:46:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39919 and previous config saved to /var/cache/conftool/dbconfig/20221116-114615-ladsgroup.json [11:46:38] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bullseye [11:49:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39920 and previous config saved to /var/cache/conftool/dbconfig/20221116-114921-ladsgroup.json [11:49:27] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:53:19] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1002: move to the single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) [11:57:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:00:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/857557/38223/" [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:01:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39921 and previous config saved to /var/cache/conftool/dbconfig/20221116-120122-ladsgroup.json [12:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39922 and previous config saved to /var/cache/conftool/dbconfig/20221116-120428-ladsgroup.json [12:06:44] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused code to support multiple NICs [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) [12:06:57] (03PS5) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [12:07:08] (03CR) 10Slyngshede: Initial checkin (033 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:07:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [12:10:51] (03PS1) 10Effie Mouzeli: C:postgres::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561 [12:11:05] (03PS1) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) [12:11:15] (03PS2) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) [12:13:51] (03CR) 10Ssingh: Pull in the fdisk-udeb in d-i (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [12:14:34] (03PS1) 10Muehlenhoff: Failover idp.w.p to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/857563 (https://phabricator.wikimedia.org/T311235) [12:16:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39923 and previous config saved to /var/cache/conftool/dbconfig/20221116-121628-ladsgroup.json [12:16:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:16:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:16:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39924 and previous config saved to /var/cache/conftool/dbconfig/20221116-121701-ladsgroup.json [12:18:16] (03PS15) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [12:18:37] (03CR) 10Hokwelum: "Thank you, It looks good. But we haven’t tested!" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [12:19:12] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10ssingh) Adding to this task in case it helps someone else; thanks to @fgiunchedi and @jcrespo for documenting the original findings. We ran into the same issue (PXE boot works f... [12:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39925 and previous config saved to /var/cache/conftool/dbconfig/20221116-121934-ladsgroup.json [12:20:28] (03CR) 10Ssingh: "Adding that I did anna-install fdisk-udeb on the cp host cp2042 and earlier I was getting:" [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [12:21:30] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:21:40] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw1002: move to the single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:22:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [12:23:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye [12:23:39] (03CR) 10Muehlenhoff: Pull in the fdisk-udeb in d-i (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [12:23:51] (03PS3) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) [12:25:10] (03CR) 10Ssingh: [C: 03+1] "Looks good and thanks for submitting the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [12:25:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:26:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [12:26:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:36] sukhe: that last message didn’t get logged because stashbot quit, fyi [12:27:37] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10Volans) @ssingh there is the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hardware/upgrade-firmware.py | sre.hardware.... [12:27:43] (see also #wikimedia-cloud) [12:27:45] (03CR) 10Jbond: "LGTM but see nit/suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [12:29:11] (03PS1) 10Vgutierrez: varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 [12:29:42] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/857560/38229/" [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:31:02] (03CR) 10CI reject: [V: 04-1] varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 (owner: 10Vgutierrez) [12:31:06] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the wrong section - https://phabricator.wikimedia.org/T224977 (10jbond) 05Open→03Resolved a:03jbond Im hoping this is resolved with the 2.5.0 release please re... [12:31:08] (03CR) 10Volans: "Question and nit inline, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [12:34:16] 10SRE, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10puppet-compiler, 10Jenkins: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10jbond) [12:34:23] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-herron, 10User-jbond: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10jbond) 05Open→03Resolved a:03jbond I have no preformed the following actions * move all reports... [12:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39926 and previous config saved to /var/cache/conftool/dbconfig/20221116-123441-ladsgroup.json [12:34:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [12:34:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [12:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39927 and previous config saved to /var/cache/conftool/dbconfig/20221116-123502-ladsgroup.json [12:35:42] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [12:38:25] (03PS2) 10Effie Mouzeli: C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561 [12:38:38] (03CR) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [12:39:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [12:39:50] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10jbond) 05Open→03Resolved a:03jbond I believe this is not fixed but please re-open if you are still seeing th... [12:39:59] (03CR) 10Effie Mouzeli: "PCC is NOOP except puppetdb1002 https://puppet-compiler.wmflabs.org/output/857561/38228/" [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli) [12:40:18] (03Abandoned) 10Jbond: DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [12:42:17] (03PS2) 10Majavah: P:pontoon: include firewall rules to allow metricsinfra scraping [puppet] - 10https://gerrit.wikimedia.org/r/857023 [12:42:55] (03CR) 10Majavah: P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [12:43:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38234/console" [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [12:45:30] (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [12:46:23] (03PS4) 10Jbond: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:47:09] (03PS5) 10Jbond: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:47:11] stashbot’s back, sukhe Amir1 and arturo might want to re-log a few messages if I’m reading the channel log correctly [12:47:12] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [12:47:37] mine is fully automated, I have no clue what it's happening [12:47:43] Lucas_WMDE: thanks, mine were not important but I appreciate the ping [12:47:51] ok [12:48:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38236/console" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:48:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:49:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS bullseye [12:49:33] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:50:33] (03PS2) 10Filippo Giunchedi: pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) [12:50:49] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:52:13] (03PS6) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [12:52:49] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10jcrespo) Hi, Moritz, I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't... [12:54:14] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1002.eqiad.wmnet with OS bullseye [12:55:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/857563 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [12:56:21] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) >>! In T311687#8399383, @jcrespo wrote: > Hi, Moritz, > > I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye... [12:57:47] (03CR) 10Jbond: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [12:58:32] (03CR) 10Muehlenhoff: "One more comment inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:58:53] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10jcrespo) Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Then please keep the good work :-P [12:59:24] (03CR) 10Jbond: [C: 03+1] "LGTM this is also a noop on puppetdb as its only a title change" [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli) [12:59:26] (03CR) 10Majavah: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:59:41] (03PS7) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [12:59:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: cleanup unused code to support multiple NICs [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:59:56] (03CR) 10Slyngshede: Initial checkin (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [13:00:06] (03CR) 10Stevemunene: [C: 03+1] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [13:01:15] (03PS2) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [13:01:17] (03PS2) 10Vgutierrez: varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 [13:01:25] (03CR) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [13:05:27] (03CR) 10CI reject: [V: 04-1] Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [13:05:50] (03PS1) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [13:06:05] (03CR) 10Jbond: [C: 03+1] "curious why you didn't also propose a similar change for hieradata/role/common/pki/multirootca.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [13:06:07] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:06:57] (03CR) 10Vgutierrez: [C: 03+2] varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 (owner: 10Vgutierrez) [13:08:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, ship it :-)" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [13:10:11] (03PS3) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [13:12:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) >>! In T311687#8399396, @jcrespo wrote: > Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Indeed, those... [13:12:40] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) Hello, @Atripathi. Privileged acces to LDAP is provided to people according to certain rules and needs. I hope this doesn't sound disrespectful, but I am not sure who is the requester (this... [13:12:59] 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) p:05Triage→03Medium a:03BTullis I'm adding the #sre-access-requests tag for visibility, but I'll carry out this work [13:13:38] (03CR) 10Slyngshede: [V: 03+2] Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [13:13:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [13:14:38] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF) [13:14:40] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10SLyngshede-WMF) 05Open→03Resolved [13:14:55] (03PS1) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [13:15:30] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF) a:03SLyngshede-WMF [13:16:18] (03PS2) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [13:16:41] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:17:44] (03CR) 10Muehlenhoff: [C: 03+2] Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [13:19:28] 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) Hi @ntsako - I've added you to that group now. You should be able to deploy to archiva and verify your group membership here: https://ldap.toolforge.org/group/archiv... [13:19:44] 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) 05Open→03Resolved [13:20:37] (03PS1) 10Ladsgroup: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) [13:21:01] 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10ntsako) Thank you for the prompt assistance @BTullis [13:21:53] (03CR) 10Muehlenhoff: Add Cumin alias for orchestrator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff) [13:24:05] (03PS1) 10Cathal Mooney: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) [13:25:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:25:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39928 and previous config saved to /var/cache/conftool/dbconfig/20221116-132531-ladsgroup.json [13:25:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:28:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) p:05Triage→03High [13:29:41] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:33:52] (03PS3) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [13:33:57] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:06] (03PS2) 10Ladsgroup: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) [13:36:27] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jbond) Any changes to apache config files [[ https://github.com/wikimedia/puppet/blob/production/modules/httpd/manifests/conf.pp#L83 | should cause... [13:39:08] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas) Hi Jaime, I'm the Disinformation Manager in the Trust & Safety team, and my team consumes data from a dashboard built on Superset. It is for access to the dashboard that I'm requesting LDAP a... [13:39:54] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10MoritzMuehlenhoff) We could hook in a call "apachectl configtest" and alert if that fails (e.g. by sending a root mail or similar)? [13:45:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39929 and previous config saved to /var/cache/conftool/dbconfig/20221116-134543-ladsgroup.json [13:45:49] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:52:00] (03PS1) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) [13:55:02] !log set thanos ring replicas to 3.20 T311690 [13:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [13:56:15] (03PS1) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 [13:57:35] (03CR) 10CI reject: [V: 04-1] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant) [13:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39930 and previous config saved to /var/cache/conftool/dbconfig/20221116-135929-ladsgroup.json [13:59:34] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1400). [14:00:04] matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:09] o/ [14:00:25] o/ [14:00:32] matthiasmullie: do you want to self-service? [14:00:38] yeah sure! [14:00:42] ok! [14:00:50] (I look at the patch earlier and it looked good to me) [14:00:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39931 and previous config saved to /var/cache/conftool/dbconfig/20221116-140050-ladsgroup.json [14:01:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie) [14:02:20] Thanks for that! [14:02:23] Starting [14:03:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39932 and previous config saved to /var/cache/conftool/dbconfig/20221116-140345-ladsgroup.json [14:03:50] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:04:18] (03PS1) 10BBlack: Update check_fresh_files_in_dir for python3 [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309) [14:04:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:05:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:05:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to drbd [14:06:51] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [14:07:21] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) I contacted Abhas in private, proving the request was legitimate. Thank you and apologies for any problem caused! [14:07:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) a:03jcrespo [14:09:11] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) [14:11:03] (03PS2) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 [14:12:02] (03CR) 10CI reject: [V: 04-1] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant) [14:14:27] (03PS3) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 [14:14:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39933 and previous config saved to /var/cache/conftool/dbconfig/20221116-141435-ladsgroup.json [14:14:53] (03Merged) 10jenkins-bot: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie) [14:15:22] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]] [14:15:27] T323152: Thumbnails not appearing in search on the beta cluster - https://phabricator.wikimedia.org/T323152 [14:15:50] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39934 and previous config saved to /var/cache/conftool/dbconfig/20221116-141556-ladsgroup.json [14:16:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to drbd [14:16:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:30] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jbond) Tempted to mark this as a duplicate of T255124, As [[ https://phabricator.wikimedia.org/T255124#6215459 | mentioned there ]] i think the be... [14:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39935 and previous config saved to /var/cache/conftool/dbconfig/20221116-141851-ladsgroup.json [14:22:32] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10MoritzMuehlenhoff) Alternatively we could simply add an Icinga alert? Something which cats the entire Apache config to one file, feeds it to apache... [14:24:57] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]] (duration: 09m 34s) [14:25:03] T323152: Thumbnails not appearing in search on the beta cluster - https://phabricator.wikimedia.org/T323152 [14:25:13] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > What was the specific change that was deployed. What was the specific change change that caused the issue? f76e73e6a (gitpuppet for pri... [14:25:55] !log UTC afternoon backport done [14:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain [14:27:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of ml-etcd1003.eqiad.wmnet to plain [14:27:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain [14:27:16] 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > Tempted to mark this as a duplicate of T255124 That, up to you, but the IMHO most important part mentioned at T323208#8399531 are not a... [14:27:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to plain [14:28:16] (03CR) 10Elukey: [C: 03+1] benthos: apply batching to webrequest_live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:29:15] (03CR) 10Elukey: [C: 03+2] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [14:29:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39936 and previous config saved to /var/cache/conftool/dbconfig/20221116-142942-ladsgroup.json [14:30:17] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:30:35] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:30:39] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:40] (03CR) 10Elukey: [C: 03+2] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [14:30:49] (03PS16) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:31:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39937 and previous config saved to /var/cache/conftool/dbconfig/20221116-143103-ladsgroup.json [14:31:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:31:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:31:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:33:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39938 and previous config saved to /var/cache/conftool/dbconfig/20221116-143358-ladsgroup.json [14:34:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:34:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:34:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39939 and previous config saved to /var/cache/conftool/dbconfig/20221116-143432-ladsgroup.json [14:34:37] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:34:49] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:34:54] (03PS2) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) [14:35:23] (03CR) 10Filippo Giunchedi: [V: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:36:28] (03PS3) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) [14:36:30] (03PS2) 10Filippo Giunchedi: benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) [14:36:31] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:32] (03PS2) 10Filippo Giunchedi: benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) [14:37:26] (03CR) 10Elukey: [C: 03+1] benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:37:28] (03CR) 10Filippo Giunchedi: [V: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:38:15] 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) 05Open→03Declined This task was opened 2.5 years ago as part of work to systematically port scripts acro... [14:38:43] (03CR) 10Elukey: [C: 03+1] benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:39:44] !log draining ganeti1019 for eventual reimage T311687 [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:49] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:40:17] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:40:19] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:40:40] !log upgrade idp1002 to CAS 6.6 T311235 [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:44] T311235: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235 [14:40:51] !log krinkle@deploy1002 Started deploy [performance/navtiming@25691da]: (no justification provided) [14:40:58] !log krinkle@deploy1002 Finished deploy [performance/navtiming@25691da]: (no justification provided) (duration: 00m 07s) [14:43:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Ottomata) Hi, checking in, any updates here? Thank you! Also CC @BTullis and @Stevemunene [14:43:28] (03PS1) 10Filippo Giunchedi: benthos: fix required 'content' for absented systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/857648 (https://phabricator.wikimedia.org/T319214) [14:43:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39940 and previous config saved to /var/cache/conftool/dbconfig/20221116-144448-ladsgroup.json [14:44:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:44:54] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:45:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39941 and previous config saved to /var/cache/conftool/dbconfig/20221116-144510-ladsgroup.json [14:45:21] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: fix required 'content' for absented systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/857648 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi) [14:48:33] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [14:49:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39942 and previous config saved to /var/cache/conftool/dbconfig/20221116-144904-ladsgroup.json [14:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:49:10] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:49:16] (03PS17) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:49:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39943 and previous config saved to /var/cache/conftool/dbconfig/20221116-144926-ladsgroup.json [14:52:32] (03PS4) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [14:57:20] (03CR) 10Herron: [C: 03+2] dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [14:58:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39944 and previous config saved to /var/cache/conftool/dbconfig/20221116-145826-ladsgroup.json [14:58:32] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:59:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [15:04:53] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [15:04:54] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:07:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [15:08:07] (03PS8) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [15:12:41] (03CR) 10Ladsgroup: [C: 03+2] "Manuel is enjoying the sun of Helsinki (or lack thereof), +2ing." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) (owner: 10Ladsgroup) [15:13:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39945 and previous config saved to /var/cache/conftool/dbconfig/20221116-151333-ladsgroup.json [15:15:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:04] (03Merged) 10jenkins-bot: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) (owner: 10Ladsgroup) [15:15:43] (03CR) 10Effie Mouzeli: [C: 03+2] C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli) [15:16:00] (03PS3) 10Effie Mouzeli: C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561 [15:16:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [15:18:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39946 and previous config saved to /var/cache/conftool/dbconfig/20221116-151849-ladsgroup.json [15:18:54] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:19:41] (03PS6) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [15:19:55] 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10jbond) notes to self we can set the DNSRacName with ` pp(r.request('patch', '/redfish/v1/Managers/iDRAC.Embedded.1/EthernetInterfaces/NIC.1', json={'HostName' : 'sretest1001... [15:23:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [15:24:13] !log installing pixman security updates on bullseye [15:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:26] (03CR) 10Hnowlan: [C: 03+1] maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:26:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [15:27:24] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: move root common settings to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:27:32] (03CR) 10Effie Mouzeli: [V: 03+2] "After merging I7d8fe42921149240e4a04b25a229a220055a97de, PCC is ok https://puppet-compiler.wmflabs.org/output/857505/38240/" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:28:07] (03CR) 10Hnowlan: [C: 04-1] maps: enable replication slots on maps1009 and maps1008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:28:23] (03PS1) 10Filippo Giunchedi: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) [15:28:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39947 and previous config saved to /var/cache/conftool/dbconfig/20221116-152839-ladsgroup.json [15:29:34] (03CR) 10CI reject: [V: 04-1] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:29:40] (03PS18) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [15:31:05] !log installing vim security updates on buster [15:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39948 and previous config saved to /var/cache/conftool/dbconfig/20221116-153355-ladsgroup.json [15:35:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [15:36:16] (03PS2) 10Filippo Giunchedi: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) [15:36:51] (03CR) 10CI reject: [V: 04-1] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:37:05] (03PS1) 10JHathaway: aux-k8s: allow kubepods to talk to pki [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120) [15:38:03] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38243/console" [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [15:38:51] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:39:33] !log initiating Cassandra bootstrap, aqs1017-a -- T307802 [15:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:38] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [15:40:58] (03PS7) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [15:41:09] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:11] (03PS8) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [15:41:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [15:41:33] RECOVERY - cassandra-a service on aqs1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:42:13] RECOVERY - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is OK: SSL OK - Certificate aqs1017-a valid until 2024-11-08 15:06:20 +0000 (expires in 722 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:42:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance [15:42:36] (03CR) 10Effie Mouzeli: [C: 03+2] maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:42:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance [15:43:13] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:15] (03CR) 10Cwhite: [C: 03+2] Add bullseye support. [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [15:43:17] (03CR) 10Effie Mouzeli: [C: 03+2] maps: add support for replication slots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:43:32] ^ expected due to ganeti2012 reboot [15:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39950 and previous config saved to /var/cache/conftool/dbconfig/20221116-154346-ladsgroup.json [15:43:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:43:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:44:15] (03CR) 10Jbond: [C: 03+1] pki: move root common settings to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:44:25] (03CR) 10JHathaway: [V: 03+1 C: 03+2] aux-k8s: allow kubepods to talk to pki [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [15:44:33] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:55] (03PS3) 10Jbond: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:45:33] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [15:45:49] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [15:46:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38245/console" [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:46:19] (03CR) 10Jbond: [C: 03+1] "LGTm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:47:11] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [15:47:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [15:47:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:48:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:49:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39951 and previous config saved to /var/cache/conftool/dbconfig/20221116-154902-ladsgroup.json [15:50:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:51:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [15:51:59] (03CR) 10Effie Mouzeli: "PCC ok https://puppet-compiler.wmflabs.org/output/857077/38244/" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:53:07] PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:20] ^ expected due to ganeti2013 reboot [15:53:37] (03PS4) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [15:53:43] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38246/console" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [15:53:54] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8373186, @MoritzMuehlenhoff wrote: > The problems of deployment-prep are a matter of resourcing, (lack of) team ownership, processes and prioritizati... [15:55:27] (03PS19) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [15:55:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:58:45] (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:59:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:59:33] (03PS20) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [15:59:44] (03PS9) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [16:01:17] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:29] (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [16:03:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39952 and previous config saved to /var/cache/conftool/dbconfig/20221116-160408-ladsgroup.json [16:04:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:04:15] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:04:19] !log powercycling ganeti2013, stuck on reboot [16:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:05:57] (03PS21) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:07:28] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti2013.codfw.wmnet [16:11:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [16:11:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [16:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39953 and previous config saved to /var/cache/conftool/dbconfig/20221116-161132-ladsgroup.json [16:11:37] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:11:52] 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10MoritzMuehlenhoff) [16:12:03] PROBLEM - Host ganeti2013 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:37] (03PS22) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:15:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:15:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) [16:15:16] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:15:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:15:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39954 and previous config saved to /var/cache/conftool/dbconfig/20221116-161522-marostegui.json [16:15:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:15:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39955 and previous config saved to /var/cache/conftool/dbconfig/20221116-161529-ladsgroup.json [16:15:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:16:24] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) For the record, the UID/CN on LDAP associated with the corporate LDAP/email is: Abhas, I updated it on the request. [16:16:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:19:05] 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10MoritzMuehlenhoff) The server first needs to be fully drained, before it can be shut down for maintenance, will update the task when ready. [16:21:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:23:43] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:24:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39956 and previous config saved to /var/cache/conftool/dbconfig/20221116-162444-marostegui.json [16:24:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:25:22] (03PS1) 10Vgutierrez: hieradata: Disable THP for jemalloc/varnish@cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) [16:26:57] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:27:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39957 and previous config saved to /var/cache/conftool/dbconfig/20221116-162746-ladsgroup.json [16:27:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:28:21] (03CR) 10Ssingh: [C: 03+1] hieradata: Disable THP for jemalloc/varnish@cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [16:28:53] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:25] (03PS3) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [16:30:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39958 and previous config saved to /var/cache/conftool/dbconfig/20221116-163035-ladsgroup.json [16:30:59] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:31:30] (03PS1) 10Jcrespo: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) [16:31:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:33:44] (03PS2) 10Vgutierrez: hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) [16:35:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:35:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:35:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39959 and previous config saved to /var/cache/conftool/dbconfig/20221116-163531-ladsgroup.json [16:35:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:36:40] (03CR) 10Ssingh: [C: 03+1] hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [16:37:06] (03PS3) 10Clément Goubert: apple-search: Switch lvs state to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) [16:37:18] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jhathaway) > How would this be different under Ansible? > > * I could render the template live on the server before committing > changes, so I wouldn't make the mistake... [16:37:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) 05Open→03In progress [16:39:26] (03PS4) 10Clément Goubert: apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) [16:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39960 and previous config saved to /var/cache/conftool/dbconfig/20221116-163951-marostegui.json [16:40:41] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [16:42:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39961 and previous config saved to /var/cache/conftool/dbconfig/20221116-164253-ladsgroup.json [16:43:11] (03PS1) 10Hnowlan: profile::maps: remove chgrp_log [puppet] - 10https://gerrit.wikimedia.org/r/857697 [16:43:35] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39962 and previous config saved to /var/cache/conftool/dbconfig/20221116-164542-ladsgroup.json [16:46:17] (03PS2) 10Clément Goubert: apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) [16:46:23] (03Abandoned) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [16:47:51] RECOVERY - Host ganeti2013 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [16:48:00] (03PS1) 10Effie Mouzeli: maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) [16:48:44] (03PS2) 10Jcrespo: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) [16:49:37] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [16:50:00] (03PS4) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) [16:50:27] RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms [16:50:30] (03CR) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott) [16:50:31] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:50:44] (03PS3) 10Jcrespo: admin: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) [16:51:13] (03PS4) 10Jcrespo: admin: Add abhas (atripathi) to the list of LDAP only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) [16:51:21] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:53:41] PROBLEM - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:53:43] ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T323222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:53:47] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10ops-monitoring-bot) [16:53:55] (JobUnavailable) resolved: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:53:57] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/857704/38250/" [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [16:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39963 and previous config saved to /var/cache/conftool/dbconfig/20221116-165457-marostegui.json [16:55:05] (03CR) 10Hnowlan: [C: 03+1] maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [16:57:55] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:58:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39964 and previous config saved to /var/cache/conftool/dbconfig/20221116-165759-ladsgroup.json [17:00:33] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38253/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:00:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39965 and previous config saved to /var/cache/conftool/dbconfig/20221116-170048-ladsgroup.json [17:00:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:06:17] 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) @Vgutierrez, while this doesn't have strict support for multiple ATS instances, bblack suggested that by simplifying all this it would... [17:07:05] (03CR) 10Vgutierrez: prometheus: Refactor ATS config monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:07:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:07:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39966 and previous config saved to /var/cache/conftool/dbconfig/20221116-170749-ladsgroup.json [17:07:54] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:07:58] (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [17:09:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39967 and previous config saved to /var/cache/conftool/dbconfig/20221116-170915-ladsgroup.json [17:09:20] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:10:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39968 and previous config saved to /var/cache/conftool/dbconfig/20221116-171003-marostegui.json [17:10:08] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:12:40] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) (owner: 10Jcrespo) [17:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39969 and previous config saved to /var/cache/conftool/dbconfig/20221116-171306-ladsgroup.json [17:13:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [17:13:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [17:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39970 and previous config saved to /var/cache/conftool/dbconfig/20221116-171316-ladsgroup.json [17:14:16] (03CR) 10Jcrespo: [C: 03+2] admin: Add abhas (atripathi) to the list of LDAP only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) (owner: 10Jcrespo) [17:17:39] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:17] (03PS1) 10Eevans: sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) [17:19:16] (03CR) 10Eevans: [C: 04-1] "Not yet; Scheduled for deployment on 2022-11-21" [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [17:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39971 and previous config saved to /var/cache/conftool/dbconfig/20221116-172421-ladsgroup.json [17:25:21] (03PS6) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [17:26:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet [17:26:22] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:26:36] !log resyncing maps2008 postgres [17:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38254/console" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [17:26:58] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [17:27:08] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:27:57] (03PS9) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [17:28:47] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [17:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39972 and previous config saved to /var/cache/conftool/dbconfig/20221116-172924-ladsgroup.json [17:29:29] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:30:15] (03PS10) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [17:30:19] (03CR) 10Vgutierrez: [C: 03+1] "this should be merged after I05460d5633b9143c07d009cfe5273d24b5675058, you can flag that dependency on the commit message with a Depends-O" [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [17:30:47] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38255/console" [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [17:31:28] (03PS8) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [17:31:30] (03CR) 10BCornwall: "Vgutierrez, while this doesn't have strict support for multiple ATS instances, bblack suggested that by simplifying all this it would make" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:32:06] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:32:29] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:03] (03CR) 10Vgutierrez: "looking good, almost ready to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:34:39] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) 05In progress→03Resolved @Abhas: [[ https://ldap.toolforge.org/user/abhas | you have been added to the WMF ldap group ]]- which should provide you access to superset. **Please check acce... [17:36:48] (03PS2) 10Clément Goubert: apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) [17:38:19] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:28] (03PS4) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [17:39:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39973 and previous config saved to /var/cache/conftool/dbconfig/20221116-173928-ladsgroup.json [17:39:59] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39975 and previous config saved to /var/cache/conftool/dbconfig/20221116-174430-ladsgroup.json [17:44:46] (03PS9) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [17:45:59] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [17:46:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:47:03] (03PS5) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [17:47:10] (03CR) 10Vgutierrez: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:48:49] (03PS6) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [17:49:25] (03PS7) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [17:49:41] (03CR) 10Dzahn: "Thank you! Would you like me to wait for testing? Or can it be merged and the test is that there is no error? From my side what I can and " [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [17:50:50] (03PS5) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [17:51:13] (03PS2) 10Cathal Mooney: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) [17:51:27] (03PS3) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 [17:51:51] (03PS4) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 [17:51:56] (03CR) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [17:53:02] !log rolling restart of varnish to pick up changes in T322903 [17:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:07] T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 [17:53:12] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38259/console" [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [17:54:19] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39976 and previous config saved to /var/cache/conftool/dbconfig/20221116-175434-ladsgroup.json [17:54:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:54:40] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:54:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:54:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:55:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39977 and previous config saved to /var/cache/conftool/dbconfig/20221116-175511-ladsgroup.json [17:56:23] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:01] (03PS10) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [17:59:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39978 and previous config saved to /var/cache/conftool/dbconfig/20221116-175937-ladsgroup.json [18:00:19] (03CR) 10BCornwall: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:00:21] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:01:07] (03CR) 10BCornwall: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:01:56] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38260/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:10:20] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` at mwmaint1002 (T318457) [18:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:25] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [18:14:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39979 and previous config saved to /var/cache/conftool/dbconfig/20221116-181443-ladsgroup.json [18:14:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:14:49] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:14:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39980 and previous config saved to /var/cache/conftool/dbconfig/20221116-181505-ladsgroup.json [18:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39981 and previous config saved to /var/cache/conftool/dbconfig/20221116-182059-ladsgroup.json [18:21:06] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:22:47] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) **Service removal plan:** From https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service 1. Silence probes : `instance... [18:25:03] (03PS1) 10Volans: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 [18:25:05] (03PS1) 10Volans: sre.hosts.provision: set iDRAC host/domain names [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 [18:25:34] (03PS1) 10Dbrant: Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) [18:25:46] (03CR) 10Volans: "To be tested on a host but should be ready for the eqsin refresh." [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [18:26:05] (03PS1) 10Dbrant: Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 [18:26:08] (03CR) 10Volans: "To be tested on a host but should be ready for the eqsin refresh." [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 (owner: 10Volans) [18:26:36] (03CR) 10Volans: [C: 04-1] "Ignore my previous message, was for the other CR. This one should *not* be merged before the eqsin refresh is completed!" [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [18:33:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39982 and previous config saved to /var/cache/conftool/dbconfig/20221116-183336-ladsgroup.json [18:33:42] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:36:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39983 and previous config saved to /var/cache/conftool/dbconfig/20221116-183605-ladsgroup.json [18:37:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39984 and previous config saved to /var/cache/conftool/dbconfig/20221116-183714-ladsgroup.json [18:37:19] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:45:54] (03PS1) 10Brennen Bearnes: local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597) [18:46:21] (03PS1) 10Dzahn: phabricator: pass missing mysql.port paramater to local settings [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597) [18:46:36] (03CR) 10Dzahn: [C: 03+1] local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597) (owner: 10Brennen Bearnes) [18:47:38] (03CR) 10Dzahn: [C: 03+2] "should go together with https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/857734" [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:48:22] (03CR) 10Brennen Bearnes: [C: 03+1] "Paired here. This should effectively be a no-op until scap changes are applied and a deploy is run." [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:48:27] (03PS1) 10Andrew Bogott: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 [18:48:35] (03PS1) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 [18:48:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39986 and previous config saved to /var/cache/conftool/dbconfig/20221116-184843-ladsgroup.json [18:51:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39987 and previous config saved to /var/cache/conftool/dbconfig/20221116-185112-ladsgroup.json [18:52:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39988 and previous config saved to /var/cache/conftool/dbconfig/20221116-185220-ladsgroup.json [18:52:34] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597) (owner: 10Brennen Bearnes) [18:56:18] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config (hopefully) [18:56:52] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config (hopefully) (duration: 00m 34s) [18:58:54] (03PS11) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [19:00:04] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900). [19:00:05] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900). [19:00:10] (03CR) 10CI reject: [V: 04-1] redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (owner: 10Jbond) [19:00:24] o/ [19:02:12] !log train 1.40.0-wmf.10 (T320515) - no current blockers, rolling to group1. [19:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:17] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:03:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39989 and previous config saved to /var/cache/conftool/dbconfig/20221116-190349-ladsgroup.json [19:03:56] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515) [19:03:58] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:05:19] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38261/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:06:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39990 and previous config saved to /var/cache/conftool/dbconfig/20221116-190618-ladsgroup.json [19:06:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:06:24] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P39991 and previous config saved to /var/cache/conftool/dbconfig/20221116-190640-ladsgroup.json [19:07:26] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:07:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39992 and previous config saved to /var/cache/conftool/dbconfig/20221116-190727-ladsgroup.json [19:11:11] !log Imported jwt-authorizer 1.1.0-1 to bullseye-wikimedia - T322691 [19:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] T322691: Build and import new release of jwt-authorizer (1.1.0) - https://phabricator.wikimedia.org/T322691 [19:11:45] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.10 refs T320515 [19:11:50] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:15:28] (03CR) 10Jdlrobson: [C: 03+1] "LGTM with one slight cautionary note." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant) [19:15:31] (03PS4) 10Jdlrobson: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant) [19:16:01] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.10 refs T320515 (duration: 04m 16s) [19:18:17] warnings here are higher than i'm really comfortable with and some canaries failed, i think i'm rolling this back to group0. [19:18:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39993 and previous config saved to /var/cache/conftool/dbconfig/20221116-191856-ladsgroup.json [19:18:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:19:02] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:19:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:19:23] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515) [19:19:27] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P39994 and previous config saved to /var/cache/conftool/dbconfig/20221116-191928-ladsgroup.json [19:20:37] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:21:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:24] (03PS1) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:22:05] (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [19:22:20] (03PS2) 10Andrew Bogott: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 [19:22:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39995 and previous config saved to /var/cache/conftool/dbconfig/20221116-192233-ladsgroup.json [19:22:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:22:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:22:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:22:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P39996 and previous config saved to /var/cache/conftool/dbconfig/20221116-192254-ladsgroup.json [19:23:33] (03PS1) 10Vgutierrez: secret: Add empty varnish/dp.master.key [labs/private] - 10https://gerrit.wikimedia.org/r/857751 [19:24:47] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.8 refs T320515 [19:24:52] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:25:47] (03PS2) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:26:48] (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [19:28:31] (03CR) 10Andrew Bogott: [C: 03+2] upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 (owner: 10Andrew Bogott) [19:28:34] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.8 refs T320515 (duration: 03m 46s) [19:31:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:23] (03Merged) 10jenkins-bot: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 (owner: 10Andrew Bogott) [19:32:40] (03PS3) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:33:16] (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [19:33:40] (03PS1) 10Slyngshede: If bug in configuration parser. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756 [19:34:19] (03PS4) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:35:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P39997 and previous config saved to /var/cache/conftool/dbconfig/20221116-193540-ladsgroup.json [19:35:46] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:36:42] (03PS5) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 [19:38:49] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:35] (03PS5) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:40:40] (03CR) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant) [19:40:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P39998 and previous config saved to /var/cache/conftool/dbconfig/20221116-194040-ladsgroup.json [19:42:53] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] secret: Add empty varnish/dp.master.key [labs/private] - 10https://gerrit.wikimedia.org/r/857751 (owner: 10Vgutierrez) [19:44:29] (03PS2) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 [19:44:45] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:39] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8399960, @jhathaway wrote: >> How would this be different under Ansible? >> >> * I could render the template live on the server before committing >>... [19:48:00] (03PS3) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) [19:48:33] (03PS6) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:49:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [19:49:45] (03PS4) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) [19:50:25] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service,prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:32] (03PS7) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:50:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [19:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P39999 and previous config saved to /var/cache/conftool/dbconfig/20221116-195046-ladsgroup.json [19:51:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38264/console" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [19:52:41] (03PS1) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 [19:54:28] (03PS8) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:55:07] (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [19:55:16] sigh [19:55:20] time to stop working I guess [19:55:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P40000 and previous config saved to /var/cache/conftool/dbconfig/20221116-195546-ladsgroup.json [19:56:10] (03PS9) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [19:58:49] (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott) [19:59:22] (03PS2) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 [20:02:44] (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott) [20:02:46] (03PS1) 10Urbanecm: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) [20:03:02] (03PS1) 10Urbanecm: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) [20:03:20] (03CR) 10Volans: "Couple of optional suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [20:03:34] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond) [20:05:29] (03PS3) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 [20:05:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P40001 and previous config saved to /var/cache/conftool/dbconfig/20221116-200553-ladsgroup.json [20:08:58] (03PS1) 10Jforrester: [Beta Cluster] Point statsd service to prometheus-labmon, cloudmetrics1001 decom'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857763 (https://phabricator.wikimedia.org/T297712) [20:09:11] (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott) [20:10:45] (03PS1) 10Jforrester: changeprop: Point Beta Cluster metrics to prometheus-labmon, cloudmetrics1002 is gone [deployment-charts] - 10https://gerrit.wikimedia.org/r/857765 (https://phabricator.wikimedia.org/T297712) [20:10:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P40002 and previous config saved to /var/cache/conftool/dbconfig/20221116-201053-ladsgroup.json [20:12:54] (03Abandoned) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:14:28] (03PS4) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 [20:15:48] (03CR) 10Vgutierrez: varnish: Generate a DP subkey daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [20:18:16] (03CR) 10Andrew Bogott: [C: 03+2] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott) [20:21:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P40003 and previous config saved to /var/cache/conftool/dbconfig/20221116-202100-ladsgroup.json [20:21:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [20:21:06] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:21:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [20:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40004 and previous config saved to /var/cache/conftool/dbconfig/20221116-202121-ladsgroup.json [20:21:54] (03Merged) 10jenkins-bot: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott) [20:22:41] (03PS10) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [20:24:23] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:35] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:26:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P40005 and previous config saved to /var/cache/conftool/dbconfig/20221116-202602-ladsgroup.json [20:26:09] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:30:15] (03CR) 10Jbond: [C: 03+2] redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond) [20:30:17] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:31] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40006 and previous config saved to /var/cache/conftool/dbconfig/20221116-203749-ladsgroup.json [20:37:57] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:41:27] !log [finished] rolling restart of varnish to pick up changes in T322903 [20:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:31] T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 [20:44:26] (03Merged) 10jenkins-bot: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond) [20:48:10] jouncebot: now [20:48:10] For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900) [20:52:27] brennen: am I interferring with train if I kick jenkins real quick? [20:52:45] thcipriani: go for it. [20:52:49] * thcipriani does [20:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40007 and previous config saved to /var/cache/conftool/dbconfig/20221116-205255-ladsgroup.json [20:53:18] !log restarting jenkins for update [20:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P40008 and previous config saved to /var/cache/conftool/dbconfig/20221116-205347-ladsgroup.json [20:53:52] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:55:35] (03PS1) 10Urbanecm: GrowthExperiments: Run updateIsActiveFlagForMentees weekly [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) [20:56:13] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [20:57:57] (03PS2) 10Urbanecm: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) [20:59:17] (03PS6) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) [20:59:42] (03CR) 10Andrea Denisse: Lower the TTL for netbox for the migration. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T2100). [21:00:04] dbrant and Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:06] (03Abandoned) 10Andrea Denisse: Lower the TTL for netbox for the migration. [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [21:00:08] (03CR) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:00:17] I can deploy today [21:00:21] hi dbrant, are you around? [21:00:34] * dbrant is present [21:00:43] great! [21:00:48] (03CR) 10Urbanecm: [C: 03+2] Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant) [21:00:54] (03CR) 10Urbanecm: [C: 03+2] Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:00:59] dbrant: I posted a quick question in the config patch, can you have a look please? [21:01:15] urbanecm: yep, looking [21:02:06] (03CR) 10Urbanecm: [C: 03+2] updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:02:12] (03CR) 10Urbanecm: [C: 03+2] updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:03:12] (03Merged) 10jenkins-bot: Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant) [21:03:18] (03Merged) 10jenkins-bot: Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:03:33] (03CR) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:03:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant) [21:03:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:04:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]] [21:04:14] T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269 [21:04:36] (03CR) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40009 and previous config saved to /var/cache/conftool/dbconfig/20221116-210802-ladsgroup.json [21:08:42] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [21:08:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40010 and previous config saved to /var/cache/conftool/dbconfig/20221116-210854-ladsgroup.json [21:09:05] !log urbanecm@deploy1002 urbanecm and dbrant: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:09:17] dbrant: can you check the two backports at mwdebug1001 now please? [21:10:19] checking, and... [21:10:57] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [21:11:44] urbanecm: I believe it's updated, but it's also dependent on the config change. [21:11:56] i see, we can do that one next :) [21:16:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:18:20] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [21:19:26] (03PS7) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:19:32] (03CR) 10Urbanecm: [C: 03+2] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:20:34] (03Merged) 10jenkins-bot: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:20:36] (03Merged) 10jenkins-bot: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:20:39] (03Merged) 10jenkins-bot: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:21:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]] (duration: 17m 34s) [21:21:49] T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269 [21:22:10] finally [21:22:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:22:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:22:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:22:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]] [21:23:01] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [21:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40011 and previous config saved to /var/cache/conftool/dbconfig/20221116-212309-ladsgroup.json [21:23:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [21:23:14] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:23:21] !log urbanecm@deploy1002 urbanecm and urbanecm and dbrant: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2 [21:23:21] 001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:23:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [21:23:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40012 and previous config saved to /var/cache/conftool/dbconfig/20221116-212330-ladsgroup.json [21:23:35] dbrant: config patch's at mwdebug1001 now, can you check? [21:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40013 and previous config saved to /var/cache/conftool/dbconfig/20221116-212400-ladsgroup.json [21:24:26] urbanecm: yay! looks good [21:24:52] great, syncing! [21:29:01] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]] (duration: 06m 05s) [21:29:02] (03PS3) 10Urbanecm: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) [21:29:05] (03CR) 10Urbanecm: [C: 03+2] [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:29:07] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [21:29:07] T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269 [21:29:11] dbrant: and all live! [21:29:13] anything else? [21:29:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:29:36] urbanecm: awesome, thanks as always! [21:29:42] no worries :) [21:30:05] (03PS8) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [21:30:19] (03Merged) 10jenkins-bot: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [21:30:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]] [21:31:06] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:31:55] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38266/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [21:33:12] (03CR) 10Jforrester: [C: 03+1] Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [21:35:43] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [21:37:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]] (duration: 06m 43s) [21:37:32] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [21:37:41] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:37:44] that should be all from me [21:38:03] !log Late UTC backport window done [21:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P40014 and previous config saved to /var/cache/conftool/dbconfig/20221116-213907-ladsgroup.json [21:39:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [21:39:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:39:22] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e08e32e]: (no justification provided) [21:39:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [21:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T318605)', diff saved to https://phabricator.wikimedia.org/P40015 and previous config saved to /var/cache/conftool/dbconfig/20221116-213928-ladsgroup.json [21:39:43] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e08e32e]: (no justification provided) (duration: 00m 20s) [21:41:39] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php`for all wikis in growthexperiments.dblist (T318457) [21:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:57] (03PS9) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [21:43:02] (03PS1) 10Herron: dispatch: upgrade to 20221110 and build with local config.js [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) [21:43:30] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38267/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [21:46:12] (03CR) 10Herron: "Approaching these at the same time since config.js changed significantly between versions" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [21:47:20] (03PS1) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [21:55:39] (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:55:59] (03PS1) 10Brennen Bearnes: specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) [21:56:23] jouncebot: nowandnext [21:56:23] For the next 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T2100) [21:56:23] In 9 hour(s) and 3 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0700) [21:58:49] 10SRE, 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10Dzahn) possibly duplicate of automatically generated T323222 [21:59:00] (03CR) 10Brennen Bearnes: [C: 03+2] specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes) [22:03:39] (03PS2) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [22:04:31] (03PS1) 10Urbanecm: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857785 (https://phabricator.wikimedia.org/T318457) [22:07:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40016 and previous config saved to /var/cache/conftool/dbconfig/20221116-220710-ladsgroup.json [22:07:15] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:11:35] (03PS1) 10JHathaway: aux-k8s: fix pod ips for network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/857786 (https://phabricator.wikimedia.org/T321120) [22:14:05] (03Merged) 10jenkins-bot: specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes) [22:15:47] 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) Thanks for all the feedback @Vgutierrez and @BBlack! Hopefully I've addressed all of your concerns. The dashboard at https://grafana.wikimedia.org/d/oMIu2XI4z... [22:16:13] 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) 05Open→03In progress [22:17:16] (03PS3) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [22:18:02] (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix pod ips for network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/857786 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [22:18:41] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179) [22:20:38] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [22:20:41] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:20:48] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [22:20:52] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:22:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40017 and previous config saved to /var/cache/conftool/dbconfig/20221116-222216-ladsgroup.json [22:24:14] (03PS1) 10Ladsgroup: wikimedia.org portal: Make portal assets also visible in the vhost [puppet] - 10https://gerrit.wikimedia.org/r/857789 (https://phabricator.wikimedia.org/T273179) [22:27:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes) [22:27:45] !log brennen@deploy1002 Started scap: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]] [22:27:50] T323184: Special page transclusion: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323184 [22:28:11] !log brennen@deploy1002 brennen and brennen: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [22:28:30] (03CR) 10Ladsgroup: [C: 03+2] wikimedia.org portal: Make portal assets also visible in the vhost [puppet] - 10https://gerrit.wikimedia.org/r/857789 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [22:32:06] (03PS4) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [22:33:35] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]] (duration: 05m 50s) [22:33:41] T323184: Special page transclusion: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323184 [22:35:11] (03PS5) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [22:36:15] !log train 1.40.0-wmf.10 (T320515) - blocker seems resolved, making one attempt to roll to group1 again. [22:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:20] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [22:36:38] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515) [22:36:39] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [22:37:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40018 and previous config saved to /var/cache/conftool/dbconfig/20221116-223722-ladsgroup.json [22:37:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:37:42] (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [22:38:10] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [22:41:18] (03PS1) 10Ladsgroup: mediawiki: Get rid of extract2.php module [puppet] - 10https://gerrit.wikimedia.org/r/857793 [22:42:13] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.10 refs T320515 [22:42:18] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [22:43:30] (03PS1) 10Ladsgroup: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 [22:43:57] jouncebot: nowandnext [22:43:57] No deployments scheduled for the next 8 hour(s) and 16 minute(s) [22:43:57] In 8 hour(s) and 16 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0700) [22:44:09] oh noicio [22:44:18] brennen: can I make some fire? [22:44:27] (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [22:45:03] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:45:35] (03PS2) 10Ladsgroup: mediawiki: Get rid of extract2.php redirect [puppet] - 10https://gerrit.wikimedia.org/r/857793 [22:46:08] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.10 refs T320515 (duration: 03m 54s) [22:46:41] (03PS6) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [22:46:57] Amir1: i'm trying to decide whether to roll back again based on number of notices at the moment [22:47:25] let me know once you're done. I have no rush, I have to wait for puppet to take affect any way [22:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40019 and previous config saved to /var/cache/conftool/dbconfig/20221116-225229-ladsgroup.json [22:52:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:52:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:52:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:52:52] Amir1: go ahead [22:53:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:53:46] awesome [22:54:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:58] (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [22:55:46] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [22:57:09] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:58:06] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:58:27] works fine in mwdebug1001, moving forward [22:58:42] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:59:40] Amir1: holler when you're done. i might roll this back out of an abundance of caution before i step afk for the day. [22:59:49] sure [23:01:30] thanks. :) [23:02:07] meanwhile: tea. [23:03:52] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: (no justification provided) (duration: 03m 49s) [23:04:40] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [23:05:26] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:12] I'm not done yet but if it's something I can fix to avoid train being stuck, can you tell me? is it the same blocker? [23:07:29] https://phabricator.wikimedia.org/T323184#8401081 [23:07:41] !log ladsgroup@deploy1002 Synchronized portals: (no justification provided) (duration: 03m 48s) [23:07:46] just a lot of noise, i think [23:08:03] but tends to make us nervous about canaries and other things getting lost in error rates. [23:08:17] you makes sense [23:08:20] *yeah [23:08:33] (03PS4) 10Ladsgroup: Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) [23:08:37] (03CR) 10Ladsgroup: [C: 03+2] Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [23:09:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [23:09:31] (03Merged) 10jenkins-bot: Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [23:09:58] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]] [23:09:59] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:10:03] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [23:10:22] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [23:11:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:37] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [23:12:39] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:09] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:15:24] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]] (duration: 05m 26s) [23:15:29] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [23:15:56] brennen: I'm good for now :) [23:16:30] Amir1: cool, thanks. rolling train back to group0 for the moment. [23:16:49] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515) [23:16:51] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [23:16:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318605)', diff saved to https://phabricator.wikimedia.org/P40020 and previous config saved to /var/cache/conftool/dbconfig/20221116-231654-ladsgroup.json [23:16:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:17:29] (03PS3) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites [puppet] - 10https://gerrit.wikimedia.org/r/857793 [23:17:34] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [23:20:09] (03PS2) 10Ladsgroup: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179) [23:21:42] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.8 refs T320515 [23:21:47] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [23:24:45] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:25:26] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.8 refs T320515 (duration: 03m 43s) [23:26:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:26:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:32:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40021 and previous config saved to /var/cache/conftool/dbconfig/20221116-233200-ladsgroup.json [23:38:41] (03CR) 10Andrew Bogott: [C: 03+2] changeprop: Point Beta Cluster metrics to prometheus-labmon, cloudmetrics1002 is gone [deployment-charts] - 10https://gerrit.wikimedia.org/r/857765 (https://phabricator.wikimedia.org/T297712) (owner: 10Jforrester) [23:39:43] (03CR) 10Andrew Bogott: [C: 03+2] [Beta Cluster] Point statsd service to prometheus-labmon, cloudmetrics1001 decom'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857763 (https://phabricator.wikimedia.org/T297712) (owner: 10Jforrester) [23:42:53] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:43:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:43:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T323214)', diff saved to https://phabricator.wikimedia.org/P40022 and previous config saved to /var/cache/conftool/dbconfig/20221116-234323-ladsgroup.json [23:43:28] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:47:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40023 and previous config saved to /var/cache/conftool/dbconfig/20221116-234708-ladsgroup.json