[00:03:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[00:03:46] <wikibugs>	 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) 05Open→03Resolved It turn out that the issue that was making the R450 to fail  during provisioning was  1 - The BIOS was set to UEFI  2 - The Serial communication settings were differen...
[00:05:19] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:05:20] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudmetrics[1001-1002].eqiad.wmnet
[00:05:20] <wikibugs>	 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) I had a chat with @jbond in IRC he is looking into this.
[00:06:02] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Andrew) a:05Andrew→03Jclark-ctr
[00:06:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance
[00:06:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2149.codfw.wmnet with reason: Maintenance
[00:06:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39861 and previous config saved to /var/cache/conftool/dbconfig/20221116-000645-marostegui.json
[00:06:50] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[00:07:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:11:54] <wikibugs>	 (03PS4) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[00:14:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2004.codfw.wmnet with OS bullseye
[00:14:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye
[00:18:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:20:44] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:21:00] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:24:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:26:52] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage
[00:31:43] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[00:32:37] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[00:33:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage
[00:40:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:40:41] <wikibugs>	 (03PS5) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[00:41:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[00:41:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39862 and previous config saved to /var/cache/conftool/dbconfig/20221116-004130-marostegui.json
[00:41:36] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[00:42:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:43:07] <wikibugs>	 (03PS6) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[00:43:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[00:45:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:46:17] <wikibugs>	 (03PS7) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[00:46:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:53:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2004.codfw.wmnet with OS bullseye
[00:53:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye completed: - dbprov2004 (**WARN**)...
[00:54:08] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:56:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39863 and previous config saved to /var/cache/conftool/dbconfig/20221116-005636-marostegui.json
[00:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318605)', diff saved to https://phabricator.wikimedia.org/P39864 and previous config saved to /var/cache/conftool/dbconfig/20221116-005921-ladsgroup.json
[00:59:26] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[01:03:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[01:03:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[01:03:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39865 and previous config saved to /var/cache/conftool/dbconfig/20221116-010330-ladsgroup.json
[01:06:38] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) 05Open→03Resolved The R650 is working fine no issue to report on my end. The only problem and I think we know already about it is that the server has 1 power...
[01:10:30] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul)
[01:11:24] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul)
[01:11:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39866 and previous config saved to /var/cache/conftool/dbconfig/20221116-011143-marostegui.json
[01:14:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P39867 and previous config saved to /var/cache/conftool/dbconfig/20221116-011427-ladsgroup.json
[01:15:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye
[01:19:10] <icinga-wm>	 PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:23:42] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:23:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:26:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:26:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39869 and previous config saved to /var/cache/conftool/dbconfig/20221116-012649-marostegui.json
[01:26:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance
[01:26:55] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[01:27:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2156.codfw.wmnet with reason: Maintenance
[01:27:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance
[01:27:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance
[01:27:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39870 and previous config saved to /var/cache/conftool/dbconfig/20221116-012726-marostegui.json
[01:28:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:29:08] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:29:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P39871 and previous config saved to /var/cache/conftool/dbconfig/20221116-012934-ladsgroup.json
[01:30:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:36:52] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:44] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:27] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye
[01:43:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye
[01:44:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318605)', diff saved to https://phabricator.wikimedia.org/P39872 and previous config saved to /var/cache/conftool/dbconfig/20221116-014441-ladsgroup.json
[01:44:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[01:44:46] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[01:44:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[01:45:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39873 and previous config saved to /var/cache/conftool/dbconfig/20221116-014502-ladsgroup.json
[01:47:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:48:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:52:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39874 and previous config saved to /var/cache/conftool/dbconfig/20221116-015223-marostegui.json
[01:52:29] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:02:14] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39875 and previous config saved to /var/cache/conftool/dbconfig/20221116-020730-marostegui.json
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:06] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:26] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye
[02:17:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye
[02:19:58] <icinga-wm>	 RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:22:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39876 and previous config saved to /var/cache/conftool/dbconfig/20221116-022236-marostegui.json
[02:31:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39877 and previous config saved to /var/cache/conftool/dbconfig/20221116-023101-ladsgroup.json
[02:31:07] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[02:37:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321130)', diff saved to https://phabricator.wikimedia.org/P39878 and previous config saved to /var/cache/conftool/dbconfig/20221116-023743-marostegui.json
[02:37:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance
[02:37:48] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[02:38:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2177.codfw.wmnet with reason: Maintenance
[02:38:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39879 and previous config saved to /var/cache/conftool/dbconfig/20221116-023815-marostegui.json
[02:46:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39880 and previous config saved to /var/cache/conftool/dbconfig/20221116-024608-ladsgroup.json
[02:57:38] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye
[03:01:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39881 and previous config saved to /var/cache/conftool/dbconfig/20221116-030115-ladsgroup.json
[03:12:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39882 and previous config saved to /var/cache/conftool/dbconfig/20221116-031230-marostegui.json
[03:12:36] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[03:16:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318605)', diff saved to https://phabricator.wikimedia.org/P39883 and previous config saved to /var/cache/conftool/dbconfig/20221116-031621-ladsgroup.json
[03:16:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[03:16:27] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[03:16:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[03:16:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39884 and previous config saved to /var/cache/conftool/dbconfig/20221116-031642-ladsgroup.json
[03:21:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39885 and previous config saved to /var/cache/conftool/dbconfig/20221116-032111-ladsgroup.json
[03:27:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P39886 and previous config saved to /var/cache/conftool/dbconfig/20221116-032737-marostegui.json
[03:36:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P39887 and previous config saved to /var/cache/conftool/dbconfig/20221116-033617-ladsgroup.json
[03:42:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P39888 and previous config saved to /var/cache/conftool/dbconfig/20221116-034243-marostegui.json
[03:51:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P39889 and previous config saved to /var/cache/conftool/dbconfig/20221116-035124-ladsgroup.json
[03:57:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321130)', diff saved to https://phabricator.wikimedia.org/P39890 and previous config saved to /var/cache/conftool/dbconfig/20221116-035750-marostegui.json
[03:57:55] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[04:06:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39891 and previous config saved to /var/cache/conftool/dbconfig/20221116-040630-ladsgroup.json
[04:06:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[04:06:36] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[04:06:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[04:06:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39892 and previous config saved to /var/cache/conftool/dbconfig/20221116-040652-ladsgroup.json
[04:07:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:18:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:19:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:49:39] <TimStarling>	 !log on mwmaint1002: running storageTypeStats.php on dewiki
[04:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39893 and previous config saved to /var/cache/conftool/dbconfig/20221116-050354-ladsgroup.json
[05:04:00] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:19:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39894 and previous config saved to /var/cache/conftool/dbconfig/20221116-051901-ladsgroup.json
[05:34:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39895 and previous config saved to /var/cache/conftool/dbconfig/20221116-053407-ladsgroup.json
[05:37:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39896 and previous config saved to /var/cache/conftool/dbconfig/20221116-053734-ladsgroup.json
[05:37:40] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:49:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318605)', diff saved to https://phabricator.wikimedia.org/P39897 and previous config saved to /var/cache/conftool/dbconfig/20221116-054914-ladsgroup.json
[05:49:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[05:49:19] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[05:49:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39898 and previous config saved to /var/cache/conftool/dbconfig/20221116-054935-ladsgroup.json
[05:52:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39899 and previous config saved to /var/cache/conftool/dbconfig/20221116-055241-ladsgroup.json
[06:07:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39900 and previous config saved to /var/cache/conftool/dbconfig/20221116-060747-ladsgroup.json
[06:19:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318605)', diff saved to https://phabricator.wikimedia.org/P39901 and previous config saved to /var/cache/conftool/dbconfig/20221116-062253-ladsgroup.json
[06:22:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[06:22:59] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[06:23:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39902 and previous config saved to /var/cache/conftool/dbconfig/20221116-062315-ladsgroup.json
[06:31:42] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance
[06:34:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance
[06:35:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[06:35:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[06:36:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance
[06:36:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance
[06:46:46] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[06:48:42] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[06:55:56] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:01:42] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39903 and previous config saved to /var/cache/conftool/dbconfig/20221116-071420-ladsgroup.json
[07:14:26] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:29:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39904 and previous config saved to /var/cache/conftool/dbconfig/20221116-072926-ladsgroup.json
[07:44:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39905 and previous config saved to /var/cache/conftool/dbconfig/20221116-074433-ladsgroup.json
[07:52:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39906 and previous config saved to /var/cache/conftool/dbconfig/20221116-075204-ladsgroup.json
[07:52:09] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:57:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:58:12] <icinga-wm>	 PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,wmf_auto_restart_apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:58:34] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:58:42] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: connect to address 208.80.154.31 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:58:48] <icinga-wm>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318605)', diff saved to https://phabricator.wikimedia.org/P39907 and previous config saved to /var/cache/conftool/dbconfig/20221116-075940-ladsgroup.json
[07:59:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[07:59:45] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:59:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[08:00:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39908 and previous config saved to /var/cache/conftool/dbconfig/20221116-080001-ladsgroup.json
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T0800). Please do the needful.
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite)
[08:01:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron)
[08:02:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'll let Cole vote but LGTM from a quick look" [puppet] - 10https://gerrit.wikimedia.org/r/855719 (https://phabricator.wikimedia.org/T319020) (owner: 10Ryan Kemper)
[08:07:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39909 and previous config saved to /var/cache/conftool/dbconfig/20221116-080710-ladsgroup.json
[08:07:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:19:05] <wikibugs>	 (03PS1) 10Matthias Mullie: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152)
[08:21:56] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39910 and previous config saved to /var/cache/conftool/dbconfig/20221116-082217-ladsgroup.json
[08:27:48] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie)
[08:35:59] <wikibugs>	 (03PS2) 10Matthias Mullie: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152)
[08:36:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet
[08:37:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318605)', diff saved to https://phabricator.wikimedia.org/P39911 and previous config saved to /var/cache/conftool/dbconfig/20221116-083723-ladsgroup.json
[08:37:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[08:37:28] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:37:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[08:45:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet
[08:45:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1022.eqiad.wmnet to cluster eqiad and group D
[08:47:10] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1022.eqiad.wmnet to cluster eqiad and group D
[08:51:04] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.122:9042 on aqs1019 is OK: TCP OK - 0.001 second response time on 10.64.48.122 port 9042 https://phabricator.wikimedia.org/T93886
[09:13:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:13:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45899
[09:13:56] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45899
[09:16:00] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30844
[09:16:31] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30844
[09:16:53] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[09:17:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 293
[09:18:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:18:24] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163)
[09:18:26] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163)
[09:18:28] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163)
[09:18:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163)
[09:18:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 293
[09:18:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Cumin alias for dispatch [puppet] - 10https://gerrit.wikimedia.org/r/857015
[09:22:15] <wikibugs>	 (03PS1) 10Elukey: turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981)
[09:22:36] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:07] <wikibugs>	 (03PS1) 10Cathal Mooney: Change get_underlay_ints() to use Netbox VRF field for filtering [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635)
[09:26:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635)
[09:28:26] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39912 and previous config saved to /var/cache/conftool/dbconfig/20221116-093112-ladsgroup.json
[09:31:17] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[09:34:25] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM, which command was failing before this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott)
[09:34:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to update in terms of the LVS connections.  After discussing with Brandon I thought it best if the links from all 4 LVS terminate on diff...
[09:37:36] <wikibugs>	 (03PS2) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635)
[09:38:16] <wikibugs>	 (03PS3) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635)
[09:46:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39913 and previous config saved to /var/cache/conftool/dbconfig/20221116-094618-ladsgroup.json
[09:46:42] <wikibugs>	 (03CR) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[09:47:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10jcrespo) > I think we know already about it is that the server has 1 power supply on the left and the other one on the right  Please be sure to comment it with @RobH so h...
[09:48:45] <wikibugs>	 (03PS1) 10Hashar: gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445)
[09:48:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:48:59] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:49:18] <wikibugs>	 (03Abandoned) 10Hashar: gerrit: remove Gerrit 3.5 obsolete @apply css statement [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[09:49:26] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[09:49:54] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: remove Gerrit 3.5 obsolete @apply css statement [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/857499 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[09:50:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:56:11] <wikibugs>	 (03PS1) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149)
[09:56:24] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good! A few remaining nits/typos and one suggestion for an additional test case (but we can also simply add that to a subsequent rel" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[09:59:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[09:59:59] <taavi>	 MatmaRex: fyi, the script is still running, currently on commonswiki (Processed 4012200 (updated 230542) of 118789703 rows)
[10:00:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[10:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39914 and previous config saved to /var/cache/conftool/dbconfig/20221116-100027-ladsgroup.json
[10:00:32] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:00:33] <MatmaRex>	 taavi: yep, thank you
[10:01:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39915 and previous config saved to /var/cache/conftool/dbconfig/20221116-100125-ladsgroup.json
[10:02:19] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 3 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond)
[10:03:09] <wikibugs>	 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond)
[10:03:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857014 (owner: 10Muehlenhoff)
[10:04:12] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 3 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) 05In progress→03Resolved With the basic selector announced yesterday i think we have all actions complete so will re...
[10:04:29] <wikibugs>	 (03PS2) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149)
[10:05:49] <logmsgbot>	 !log kevinbazira@deploy1002 Started deploy [ores/deploy@0114799]: T319373
[10:05:54] <stashbot>	 T319373: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373
[10:06:22] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/857014
[10:11:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/857014 (owner: 10Muehlenhoff)
[10:14:54] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 04-1] "PCC Fails https://puppet-compiler.wmflabs.org/output/857505/38220/" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[10:16:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39916 and previous config saved to /var/cache/conftool/dbconfig/20221116-101631-ladsgroup.json
[10:16:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[10:16:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:16:41] <logmsgbot>	 !log kevinbazira@deploy1002 Finished deploy [ores/deploy@0114799]: T319373 (duration: 10m 51s)
[10:16:45] <stashbot>	 T319373: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373
[10:16:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[10:16:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39917 and previous config saved to /var/cache/conftool/dbconfig/20221116-101653-ladsgroup.json
[10:17:57] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214)
[10:22:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[10:23:49] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:23:59] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214)
[10:24:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[10:25:25] <wikibugs>	 (03PS7) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[10:28:49] <wikibugs>	 (03CR) 10Hnowlan: "lgtm, one nit" [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[10:29:34] <jynus>	 !log restarting apache on lists.wm.o
[10:29:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:06] <urbanecm>	 !log Run `mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php` for all wikis in growthexperiments.dblist at mwmaint1002 (T318457)
[10:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:10] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[10:30:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[10:30:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38221/console" [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[10:31:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:31:43] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:32:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: default to valid external url [puppet] - 10https://gerrit.wikimedia.org/r/857522 (https://phabricator.wikimedia.org/T301944)
[10:32:23] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:32:31] <icinga-wm>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-12-22 06:15:55 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[10:36:25] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:32] <wikibugs>	 (03CR) 10Ladsgroup: Add Cumin alias for orchestrator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff)
[10:43:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas)
[10:47:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:49:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:49:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:51:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:51:16] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163)
[10:51:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:51:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:51:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:51:54] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163)
[10:51:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:52:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[10:53:15] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Ladsgroup)
[10:57:01] <wikibugs>	 (03PS1) 10Phedenskog: Update phedenskogs keys. [puppet] - 10https://gerrit.wikimedia.org/r/857529
[10:59:11] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:57] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Ladsgroup) p:05Triage→03High
[11:03:21] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10jcrespo)
[11:06:05] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10jcrespo) I am marking this as an incident, as lists were down for around 2.5h. Although it could also be considered an #wikimedia-incident-actiona...
[11:10:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[11:11:38] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache restart - https://phabricator.wikimedia.org/T323208 (10Vgutierrez) hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet
[11:12:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:13:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/856969/38222/" [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:13:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:14:05] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo)
[11:14:15] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye
[11:14:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[11:14:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet  I believe the updated tit...
[11:14:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw1001: prepare for reimage into the new vlan NIC name with a single NIC [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:17:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add benthos jobs [puppet] - 10https://gerrit.wikimedia.org/r/857519 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[11:26:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:26:57] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[11:27:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[11:27:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[11:29:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[11:29:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[11:31:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39918 and previous config saved to /var/cache/conftool/dbconfig/20221116-113108-ladsgroup.json
[11:31:14] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:31:23] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[11:31:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:33:02] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:34:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214)
[11:35:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214)
[11:36:43] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[11:39:27] <wikibugs>	 (03PS5) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149)
[11:40:06] <wikibugs>	 (03PS1) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547
[11:40:54] <wikibugs>	 (03PS2) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547
[11:40:58] <wikibugs>	 (03PS6) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149)
[11:45:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/857547 (owner: 10Muehlenhoff)
[11:45:51] <wikibugs>	 (03CR) 10Effie Mouzeli: maps: add support for replication slots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[11:46:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39919 and previous config saved to /var/cache/conftool/dbconfig/20221116-114615-ladsgroup.json
[11:46:38] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bullseye
[11:49:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39920 and previous config saved to /var/cache/conftool/dbconfig/20221116-114921-ladsgroup.json
[11:49:27] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:53:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw1002: move to the single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184)
[11:57:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[12:00:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/857557/38223/" [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:01:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39921 and previous config saved to /var/cache/conftool/dbconfig/20221116-120122-ladsgroup.json
[12:04:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39922 and previous config saved to /var/cache/conftool/dbconfig/20221116-120428-ladsgroup.json
[12:06:44] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused code to support multiple NICs [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184)
[12:06:57] <wikibugs>	 (03PS5) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595)
[12:07:08] <wikibugs>	 (03CR) 10Slyngshede: Initial checkin (033 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[12:07:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:07:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye
[12:10:51] <wikibugs>	 (03PS1) 10Effie Mouzeli: C:postgres::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561
[12:11:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309)
[12:11:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309)
[12:13:51] <wikibugs>	 (03CR) 10Ssingh: Pull in the fdisk-udeb in d-i (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff)
[12:14:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover idp.w.p to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/857563 (https://phabricator.wikimedia.org/T311235)
[12:16:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318605)', diff saved to https://phabricator.wikimedia.org/P39923 and previous config saved to /var/cache/conftool/dbconfig/20221116-121628-ladsgroup.json
[12:16:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:16:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:16:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:17:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39924 and previous config saved to /var/cache/conftool/dbconfig/20221116-121701-ladsgroup.json
[12:18:16] <wikibugs>	 (03PS15) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[12:18:37] <wikibugs>	 (03CR) 10Hokwelum: "Thank you, It looks good. But we haven’t tested!" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn)
[12:19:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10ssingh) Adding to this task in case it helps someone else; thanks to @fgiunchedi and @jcrespo for documenting the original findings.  We ran into the same issue (PXE boot works f...
[12:19:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39925 and previous config saved to /var/cache/conftool/dbconfig/20221116-121934-ladsgroup.json
[12:20:28] <wikibugs>	 (03CR) 10Ssingh: "Adding that I did anna-install fdisk-udeb on the cp host cp2042 and earlier I was getting:" [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff)
[12:21:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:21:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw1002: move to the single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/857557 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:22:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage
[12:23:16] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye
[12:23:39] <wikibugs>	 (03CR) 10Muehlenhoff: Pull in the fdisk-udeb in d-i (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff)
[12:23:51] <wikibugs>	 (03PS3) 10Muehlenhoff: Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309)
[12:25:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good and thanks for submitting the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff)
[12:25:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[12:26:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage
[12:26:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:27:36] <Lucas_WMDE>	 sukhe: that last message didn’t get logged because stashbot quit, fyi
[12:27:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10Volans) @ssingh there is the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hardware/upgrade-firmware.py | sre.hardware....
[12:27:43] <Lucas_WMDE>	 (see also #wikimedia-cloud)
[12:27:45] <wikibugs>	 (03CR) 10Jbond: "LGTM but see nit/suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[12:29:11] <wikibugs>	 (03PS1) 10Vgutierrez: varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572
[12:29:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/857560/38229/" [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:31:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 (owner: 10Vgutierrez)
[12:31:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the wrong section - https://phabricator.wikimedia.org/T224977 (10jbond) 05Open→03Resolved a:03jbond Im hoping this is resolved with the  2.5.0 release please re...
[12:31:08] <wikibugs>	 (03CR) 10Volans: "Question and nit inline, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[12:34:16] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10puppet-compiler, 10Jenkins: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10jbond)
[12:34:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-herron, 10User-jbond: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10jbond) 05Open→03Resolved a:03jbond I have no preformed the following actions  * move all reports...
[12:34:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39926 and previous config saved to /var/cache/conftool/dbconfig/20221116-123441-ladsgroup.json
[12:34:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[12:34:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[12:35:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39927 and previous config saved to /var/cache/conftool/dbconfig/20221116-123502-ladsgroup.json
[12:35:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage
[12:38:25] <wikibugs>	 (03PS2) 10Effie Mouzeli: C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561
[12:38:38] <wikibugs>	 (03CR) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[12:39:02] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage
[12:39:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10jbond) 05Open→03Resolved a:03jbond I believe this is not fixed but please re-open if you are still seeing th...
[12:39:59] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC is NOOP except puppetdb1002 https://puppet-compiler.wmflabs.org/output/857561/38228/" [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli)
[12:40:18] <wikibugs>	 (03Abandoned) 10Jbond: DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond)
[12:42:17] <wikibugs>	 (03PS2) 10Majavah: P:pontoon: include firewall rules to allow metricsinfra scraping [puppet] - 10https://gerrit.wikimedia.org/r/857023
[12:42:55] <wikibugs>	 (03CR) 10Majavah: P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[12:43:57] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38234/console" [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[12:45:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[12:46:23] <wikibugs>	 (03PS4) 10Jbond: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:47:09] <wikibugs>	 (03PS5) 10Jbond: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:47:11] <Lucas_WMDE>	 stashbot’s back, sukhe Amir1 and arturo might want to re-log a few messages if I’m reading the channel log correctly
[12:47:12] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[12:47:37] <Amir1>	 mine is fully automated, I have no clue what it's happening 
[12:47:43] <sukhe>	 Lucas_WMDE: thanks, mine were not important but I appreciate the ping
[12:47:51] <Lucas_WMDE>	 ok
[12:48:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38236/console" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:48:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:49:16] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS bullseye
[12:49:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:50:33] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163)
[12:50:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: serve public pki certs via fileserver [puppet] - 10https://gerrit.wikimedia.org/r/857475 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[12:52:13] <wikibugs>	 (03PS6) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595)
[12:52:49] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10jcrespo) Hi, Moritz,  I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't...
[12:54:14] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1002.eqiad.wmnet with OS bullseye
[12:55:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/857563 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff)
[12:56:21] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) >>! In T311687#8399383, @jcrespo wrote: > Hi, Moritz, >  > I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye...
[12:57:47] <wikibugs>	 (03CR) 10Jbond: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[12:58:32] <wikibugs>	 (03CR) 10Muehlenhoff: "One more comment inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[12:58:53] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10jcrespo) Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Then please keep the good work :-P
[12:59:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM this is also a noop on puppetdb as its only a title change" [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli)
[12:59:26] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:59:41] <wikibugs>	 (03PS7) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595)
[12:59:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: cleanup unused code to support multiple NICs [puppet] - 10https://gerrit.wikimedia.org/r/857560 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:59:56] <wikibugs>	 (03CR) 10Slyngshede: Initial checkin (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[13:00:06] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[13:01:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996
[13:01:17] <wikibugs>	 (03PS2) 10Vgutierrez: varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572
[13:01:25] <wikibugs>	 (03CR) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[13:05:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[13:05:50] <wikibugs>	 (03PS1) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645)
[13:06:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "curious why you didn't also propose a similar change for hieradata/role/common/pki/multirootca.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[13:06:07] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:06:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish::tests: Update PCC URL regex [puppet] - 10https://gerrit.wikimedia.org/r/857572 (owner: 10Vgutierrez)
[13:08:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, ship it :-)" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[13:10:11] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996
[13:12:34] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) >>! In T311687#8399396, @jcrespo wrote: > Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done?   Indeed, those...
[13:12:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) Hello, @Atripathi. Privileged acces to LDAP is provided to people according to certain rules and needs. I hope this doesn't sound disrespectful, but I am not sure who is the requester (this...
[13:12:59] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) p:05Triage→03Medium a:03BTullis I'm adding the #sre-access-requests tag for visibility, but I'll carry out this work
[13:13:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[13:13:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[13:14:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF)
[13:14:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10SLyngshede-WMF) 05Open→03Resolved
[13:14:55] <wikibugs>	 (03PS1) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635)
[13:15:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF) a:03SLyngshede-WMF
[13:16:18] <wikibugs>	 (03PS2) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635)
[13:16:41] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:17:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pull in the fdisk-udeb in d-i [puppet] - 10https://gerrit.wikimedia.org/r/857562 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff)
[13:19:28] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) Hi @ntsako - I've added you to that group now. You should be able to deploy to archiva and verify your group membership here: https://ldap.toolforge.org/group/archiv...
[13:19:44] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10BTullis) 05Open→03Resolved
[13:20:37] <wikibugs>	 (03PS1) 10Ladsgroup: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214)
[13:21:01] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Add shell username ntsako to archiva-deployers - https://phabricator.wikimedia.org/T323213 (10ntsako) Thank you for the prompt assistance @BTullis
[13:21:53] <wikibugs>	 (03CR) 10Muehlenhoff: Add Cumin alias for orchestrator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff)
[13:24:05] <wikibugs>	 (03PS1) 10Cathal Mooney: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635)
[13:25:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:25:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:25:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39928 and previous config saved to /var/cache/conftool/dbconfig/20221116-132531-ladsgroup.json
[13:25:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[13:28:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) p:05Triage→03High
[13:29:41] <icinga-wm>	 PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:33:52] <wikibugs>	 (03PS3) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635)
[13:33:57] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:06] <wikibugs>	 (03PS2) 10Ladsgroup: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214)
[13:36:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jbond) Any changes to apache config files [[ https://github.com/wikimedia/puppet/blob/production/modules/httpd/manifests/conf.pp#L83 | should cause...
[13:39:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas) Hi Jaime,  I'm the Disinformation Manager in the Trust & Safety team, and my team consumes data from a dashboard built on Superset. It is for access to the dashboard that I'm requesting LDAP a...
[13:39:54] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10MoritzMuehlenhoff) We could hook in a call "apachectl configtest" and alert if that fails (e.g. by sending a root mail or similar)?
[13:45:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39929 and previous config saved to /var/cache/conftool/dbconfig/20221116-134543-ladsgroup.json
[13:45:49] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:52:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214)
[13:55:02] <Emperor>	 !log set thanos ring replicas to 3.20 T311690
[13:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:07] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[13:56:15] <wikibugs>	 (03PS1) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621
[13:57:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant)
[13:59:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39930 and previous config saved to /var/cache/conftool/dbconfig/20221116-135929-ladsgroup.json
[13:59:34] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1400).
[14:00:04] <jouncebot>	 matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:09] <Lucas_WMDE>	 o/
[14:00:09] <matthiasmullie>	 o/
[14:00:25] <urbanecm>	 o/
[14:00:32] <Lucas_WMDE>	 matthiasmullie: do you want to self-service?
[14:00:38] <matthiasmullie>	 yeah sure!
[14:00:42] <Lucas_WMDE>	 ok!
[14:00:50] <Lucas_WMDE>	 (I look at the patch earlier and it looked good to me)
[14:00:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39931 and previous config saved to /var/cache/conftool/dbconfig/20221116-140050-ladsgroup.json
[14:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie)
[14:02:20] <matthiasmullie>	 Thanks for that!
[14:02:23] <matthiasmullie>	 Starting
[14:03:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39932 and previous config saved to /var/cache/conftool/dbconfig/20221116-140345-ladsgroup.json
[14:03:50] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:04:18] <wikibugs>	 (03PS1) 10BBlack: Update check_fresh_files_in_dir for python3 [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309)
[14:04:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[14:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[14:05:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to drbd
[14:06:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[14:07:21] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) I contacted Abhas in private, proving the request was legitimate. Thank you and apologies for any problem caused!
[14:07:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) a:03jcrespo
[14:09:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo)
[14:11:03] <wikibugs>	 (03PS2) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621
[14:12:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant)
[14:14:27] <wikibugs>	 (03PS3) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621
[14:14:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39933 and previous config saved to /var/cache/conftool/dbconfig/20221116-141435-ladsgroup.json
[14:14:53] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure array is passed to getProperties [extensions/PageImages] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857426 (https://phabricator.wikimedia.org/T323152) (owner: 10Matthias Mullie)
[14:15:22] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]]
[14:15:27] <stashbot>	 T323152: Thumbnails not appearing in search on the beta cluster - https://phabricator.wikimedia.org/T323152
[14:15:50] <logmsgbot>	 !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:15:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39934 and previous config saved to /var/cache/conftool/dbconfig/20221116-141556-ladsgroup.json
[14:16:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to drbd
[14:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:17:30] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jbond) Tempted to mark this as a duplicate of T255124,  As [[ https://phabricator.wikimedia.org/T255124#6215459 | mentioned there ]] i think the be...
[14:18:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39935 and previous config saved to /var/cache/conftool/dbconfig/20221116-141851-ladsgroup.json
[14:22:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10MoritzMuehlenhoff) Alternatively we could simply add an Icinga alert? Something which cats the entire Apache config to one file, feeds it to apache...
[14:24:57] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:857426|Ensure array is passed to getProperties (T323152)]] (duration: 09m 34s)
[14:25:03] <stashbot>	 T323152: Thumbnails not appearing in search on the beta cluster - https://phabricator.wikimedia.org/T323152
[14:25:13] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > What was the specific change that was deployed. What was the specific change change that caused the issue?  f76e73e6a (gitpuppet for pri...
[14:25:55] <matthiasmullie>	 !log UTC afternoon backport done
[14:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[14:27:07] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[14:27:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[14:27:16] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208 (10jcrespo) > Tempted to mark this as a duplicate of T255124  That, up to you, but the IMHO most important part mentioned at T323208#8399531 are not a...
[14:27:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[14:28:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: apply batching to webrequest_live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:29:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] turnilo: add cache_status to webrequest_live_sampled [puppet] - 10https://gerrit.wikimedia.org/r/857476 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[14:29:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39936 and previous config saved to /var/cache/conftool/dbconfig/20221116-142942-ladsgroup.json
[14:30:17] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:30:35] <icinga-wm>	 RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:30:39] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[14:30:49] <wikibugs>	 (03PS16) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:31:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318605)', diff saved to https://phabricator.wikimedia.org/P39937 and previous config saved to /var/cache/conftool/dbconfig/20221116-143103-ladsgroup.json
[14:31:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[14:31:08] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:31:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[14:33:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[14:33:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39938 and previous config saved to /var/cache/conftool/dbconfig/20221116-143358-ladsgroup.json
[14:34:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[14:34:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:34:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:34:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39939 and previous config saved to /var/cache/conftool/dbconfig/20221116-143432-ladsgroup.json
[14:34:37] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:34:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:34:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214)
[14:35:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:36:28] <wikibugs>	 (03PS3) 10Filippo Giunchedi: benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214)
[14:36:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214)
[14:36:31] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:32] <wikibugs>	 (03PS2) 10Filippo Giunchedi: benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214)
[14:37:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:37:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] benthos: apply batching to webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/857619 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:38:15] <wikibugs>	 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) 05Open→03Declined This task was opened 2.5 years ago as part of work to systematically port scripts acro...
[14:38:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:39:44] <moritzm>	 !log draining ganeti1019 for eventual reimage T311687
[14:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:49] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[14:40:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: reload on config changes [puppet] - 10https://gerrit.wikimedia.org/r/857545 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:40:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/857544 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:40:40] <moritzm>	 !log upgrade idp1002 to CAS 6.6 T311235
[14:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:44] <stashbot>	 T311235: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235
[14:40:51] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [performance/navtiming@25691da]: (no justification provided)
[14:40:58] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [performance/navtiming@25691da]: (no justification provided) (duration: 00m 07s)
[14:43:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Ottomata) Hi, checking in, any updates here?   Thank you!  Also CC @BTullis and @Stevemunene
[14:43:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: fix required 'content' for absented systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/857648 (https://phabricator.wikimedia.org/T319214)
[14:43:59] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:03] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323214)', diff saved to https://phabricator.wikimedia.org/P39940 and previous config saved to /var/cache/conftool/dbconfig/20221116-144448-ladsgroup.json
[14:44:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[14:44:54] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:45:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[14:45:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39941 and previous config saved to /var/cache/conftool/dbconfig/20221116-144510-ladsgroup.json
[14:45:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: fix required 'content' for absented systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/857648 (https://phabricator.wikimedia.org/T319214) (owner: 10Filippo Giunchedi)
[14:48:33] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms
[14:49:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318605)', diff saved to https://phabricator.wikimedia.org/P39942 and previous config saved to /var/cache/conftool/dbconfig/20221116-144904-ladsgroup.json
[14:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:49:10] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:49:16] <wikibugs>	 (03PS17) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[14:49:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39943 and previous config saved to /var/cache/conftool/dbconfig/20221116-144926-ladsgroup.json
[14:52:32] <wikibugs>	 (03PS4) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635)
[14:57:20] <wikibugs>	 (03CR) 10Herron: [C: 03+2] dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron)
[14:58:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39944 and previous config saved to /var/cache/conftool/dbconfig/20221116-145826-ladsgroup.json
[14:58:32] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:59:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
[15:04:53] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[15:04:54] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[15:07:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
[15:08:07] <wikibugs>	 (03PS8) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[15:12:41] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "Manuel is enjoying the sun of Helsinki (or lack thereof), +2ing." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) (owner: 10Ladsgroup)
[15:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39945 and previous config saved to /var/cache/conftool/dbconfig/20221116-151333-ladsgroup.json
[15:15:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add 2022/fix_flaggedrevs_unsigned_T323214.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/857594 (https://phabricator.wikimedia.org/T323214) (owner: 10Ladsgroup)
[15:15:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561 (owner: 10Effie Mouzeli)
[15:16:00] <wikibugs>	 (03PS3) 10Effie Mouzeli: C:postgresql::master: add support for multiple replicas [puppet] - 10https://gerrit.wikimedia.org/r/857561
[15:16:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet
[15:18:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39946 and previous config saved to /var/cache/conftool/dbconfig/20221116-151849-ladsgroup.json
[15:18:54] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:19:41] <wikibugs>	 (03PS6) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149)
[15:19:55] <wikibugs>	 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10jbond) notes to self we can set the DNSRacName with   ` pp(r.request('patch', '/redfish/v1/Managers/iDRAC.Embedded.1/EthernetInterfaces/NIC.1', json={'HostName' : 'sretest1001...
[15:23:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet
[15:24:13] <moritzm>	 !log installing pixman security updates on bullseye
[15:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:26:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
[15:27:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pki: move root common settings to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:27:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2] "After merging I7d8fe42921149240e4a04b25a229a220055a97de, PCC is ok https://puppet-compiler.wmflabs.org/output/857505/38240/" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:28:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] maps: enable replication slots on maps1009 and maps1008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:28:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163)
[15:28:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39947 and previous config saved to /var/cache/conftool/dbconfig/20221116-152839-ladsgroup.json
[15:29:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:29:40] <wikibugs>	 (03PS18) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[15:31:05] <moritzm>	 !log installing vim security updates on buster
[15:31:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:19] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39948 and previous config saved to /var/cache/conftool/dbconfig/20221116-153355-ladsgroup.json
[15:35:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
[15:36:16] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163)
[15:36:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:37:05] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: allow kubepods to talk to pki [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120)
[15:38:03] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38243/console" [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[15:38:51] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:39:33] <urandom>	 !log initiating Cassandra bootstrap, aqs1017-a -- T307802
[15:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:38] <stashbot>	 T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802
[15:40:58] <wikibugs>	 (03PS7) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149)
[15:41:09] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:11] <wikibugs>	 (03PS8) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149)
[15:41:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
[15:41:33] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:42:13] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is OK: SSL OK - Certificate aqs1017-a valid until 2024-11-08 15:06:20 +0000 (expires in 722 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:42:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[15:42:36] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:42:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[15:43:13] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:43:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Add bullseye support. [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite)
[15:43:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: add support for replication slots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:43:32] <moritzm>	 ^ expected due to ganeti2012 reboot
[15:43:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323214)', diff saved to https://phabricator.wikimedia.org/P39950 and previous config saved to /var/cache/conftool/dbconfig/20221116-154346-ladsgroup.json
[15:43:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[15:43:51] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:44:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[15:44:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] pki: move root common settings to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:44:25] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] aux-k8s: allow kubepods to talk to pki [puppet] - 10https://gerrit.wikimedia.org/r/857668 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[15:44:33] <icinga-wm>	 PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:55] <wikibugs>	 (03PS3) 10Jbond: hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:45:33] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[15:45:49] <icinga-wm>	 RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms
[15:46:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38245/console" [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:46:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:47:11] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[15:47:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
[15:47:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:48:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:49:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39951 and previous config saved to /var/cache/conftool/dbconfig/20221116-154902-ladsgroup.json
[15:50:35] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:51:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move multirootca standard settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[15:51:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
[15:51:59] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC ok https://puppet-compiler.wmflabs.org/output/857077/38244/" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:53:07] <icinga-wm>	 PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:53:20] <moritzm>	 ^ expected due to ganeti2013 reboot
[15:53:37] <wikibugs>	 (03PS4) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996
[15:53:43] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38246/console" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[15:53:54] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8373186, @MoritzMuehlenhoff wrote: > The problems of deployment-prep are a matter of resourcing, (lack of) team ownership, processes and prioritizati...
[15:55:27] <wikibugs>	 (03PS19) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[15:55:35] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:58:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:59:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:59:33] <wikibugs>	 (03PS20) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[15:59:44] <wikibugs>	 (03PS9) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149)
[16:01:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[16:03:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:04:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323214)', diff saved to https://phabricator.wikimedia.org/P39952 and previous config saved to /var/cache/conftool/dbconfig/20221116-160408-ladsgroup.json
[16:04:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[16:04:15] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:04:19] <moritzm>	 !log powercycling ganeti2013, stuck on reboot
[16:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[16:05:57] <wikibugs>	 (03PS21) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[16:07:28] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti2013.codfw.wmnet
[16:11:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[16:11:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[16:11:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39953 and previous config saved to /var/cache/conftool/dbconfig/20221116-161132-ladsgroup.json
[16:11:37] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:11:52] <wikibugs>	 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10MoritzMuehlenhoff)
[16:12:03] <icinga-wm>	 PROBLEM - Host ganeti2013 is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:37] <wikibugs>	 (03PS22) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[16:15:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[16:15:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo)
[16:15:16] <wikibugs>	 (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[16:15:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[16:15:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39954 and previous config saved to /var/cache/conftool/dbconfig/20221116-161522-marostegui.json
[16:15:27] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[16:15:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39955 and previous config saved to /var/cache/conftool/dbconfig/20221116-161529-ladsgroup.json
[16:15:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:16:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) For the record, the UID/CN on LDAP associated with the corporate LDAP/email is: Abhas, I updated it on the request.
[16:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:17:19] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[16:19:05] <wikibugs>	 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10MoritzMuehlenhoff) The server first needs to be fully drained, before it can be shut down for maintenance, will update the task when ready.
[16:21:13] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:23:43] <wikibugs>	 (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[16:24:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39956 and previous config saved to /var/cache/conftool/dbconfig/20221116-162444-marostegui.json
[16:24:50] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[16:25:22] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: Disable THP for jemalloc/varnish@cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903)
[16:26:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:27:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39957 and previous config saved to /var/cache/conftool/dbconfig/20221116-162746-ladsgroup.json
[16:27:51] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:28:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hieradata: Disable THP for jemalloc/varnish@cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez)
[16:28:53] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:25] <wikibugs>	 (03PS3) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[16:30:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39958 and previous config saved to /var/cache/conftool/dbconfig/20221116-163035-ladsgroup.json
[16:30:59] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[16:31:30] <wikibugs>	 (03PS1) 10Jcrespo: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207)
[16:31:51] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[16:33:44] <wikibugs>	 (03PS2) 10Vgutierrez: hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903)
[16:35:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[16:35:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[16:35:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39959 and previous config saved to /var/cache/conftool/dbconfig/20221116-163531-ladsgroup.json
[16:35:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:36:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez)
[16:37:06] <wikibugs>	 (03PS3) 10Clément Goubert: apple-search: Switch lvs state to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296)
[16:37:18] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jhathaway) > How would this be different under Ansible? > > * I could render the template live on the server before committing >   changes, so I wouldn't make the mistake...
[16:37:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) 05Open→03In progress
[16:39:26] <wikibugs>	 (03PS4) 10Clément Goubert: apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296)
[16:39:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39960 and previous config saved to /var/cache/conftool/dbconfig/20221116-163951-marostegui.json
[16:40:41] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hieradata: Disable THP for jemalloc/varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/857686 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez)
[16:42:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39961 and previous config saved to /var/cache/conftool/dbconfig/20221116-164253-ladsgroup.json
[16:43:11] <wikibugs>	 (03PS1) 10Hnowlan: profile::maps: remove chgrp_log [puppet] - 10https://gerrit.wikimedia.org/r/857697
[16:43:35] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39962 and previous config saved to /var/cache/conftool/dbconfig/20221116-164542-ladsgroup.json
[16:46:17] <wikibugs>	 (03PS2) 10Clément Goubert: apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296)
[16:46:23] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: maps: enable postres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857505 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[16:47:51] <icinga-wm>	 RECOVERY - Host ganeti2013 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms
[16:48:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149)
[16:48:44] <wikibugs>	 (03PS2) 10Jcrespo: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207)
[16:49:37] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms
[16:50:00] <wikibugs>	 (03PS4) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949)
[16:50:27] <icinga-wm>	 RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms
[16:50:30] <wikibugs>	 (03CR) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott)
[16:50:31] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[16:50:44] <wikibugs>	 (03PS3) 10Jcrespo: admin: Add abhas (atripathi) to the list of LDAP only users for WMF group [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207)
[16:51:13] <wikibugs>	 (03PS4) 10Jcrespo: admin: Add abhas (atripathi) to the list of LDAP only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207)
[16:51:21] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[16:53:41] <icinga-wm>	 PROBLEM - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:53:43] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T323222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:53:47] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10ops-monitoring-bot)
[16:53:55] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:53:57] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/857704/38250/" [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[16:54:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39963 and previous config saved to /var/cache/conftool/dbconfig/20221116-165457-marostegui.json
[16:55:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[16:57:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39964 and previous config saved to /var/cache/conftool/dbconfig/20221116-165759-ladsgroup.json
[17:00:33] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38253/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:00:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318605)', diff saved to https://phabricator.wikimedia.org/P39965 and previous config saved to /var/cache/conftool/dbconfig/20221116-170048-ladsgroup.json
[17:00:53] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[17:06:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) @Vgutierrez, while this doesn't have strict support for multiple ATS instances, bblack suggested that by simplifying all this it would...
[17:07:05] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: Refactor ATS config monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:07:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[17:07:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[17:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39966 and previous config saved to /var/cache/conftool/dbconfig/20221116-170749-ladsgroup.json
[17:07:54] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[17:07:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable postgres replication slots in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/857704 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli)
[17:09:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39967 and previous config saved to /var/cache/conftool/dbconfig/20221116-170915-ladsgroup.json
[17:09:20] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:10:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39968 and previous config saved to /var/cache/conftool/dbconfig/20221116-171003-marostegui.json
[17:10:08] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[17:12:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) (owner: 10Jcrespo)
[17:13:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323214)', diff saved to https://phabricator.wikimedia.org/P39969 and previous config saved to /var/cache/conftool/dbconfig/20221116-171306-ladsgroup.json
[17:13:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[17:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[17:13:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39970 and previous config saved to /var/cache/conftool/dbconfig/20221116-171316-ladsgroup.json
[17:14:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Add abhas (atripathi) to the list of LDAP only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/857689 (https://phabricator.wikimedia.org/T323207) (owner: 10Jcrespo)
[17:17:39] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:17] <wikibugs>	 (03PS1) 10Eevans: sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244)
[17:19:16] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] "Not yet; Scheduled for deployment on 2022-11-21" [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[17:24:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39971 and previous config saved to /var/cache/conftool/dbconfig/20221116-172421-ladsgroup.json
[17:25:21] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184)
[17:26:09] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet
[17:26:22] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[17:26:36] <hnowlan>	 !log resyncing maps2008 postgres
[17:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38254/console" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[17:26:58] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[17:27:08] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[17:27:57] <wikibugs>	 (03PS9) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[17:28:47] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[17:29:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39972 and previous config saved to /var/cache/conftool/dbconfig/20221116-172924-ladsgroup.json
[17:29:29] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:30:15] <wikibugs>	 (03PS10) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[17:30:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "this should be merged after I05460d5633b9143c07d009cfe5273d24b5675058, you can flag that dependency on the commit message with a Depends-O" [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[17:30:47] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38255/console" [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[17:31:28] <wikibugs>	 (03PS8) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[17:31:30] <wikibugs>	 (03CR) 10BCornwall: "Vgutierrez, while this doesn't have strict support for multiple ATS instances, bblack suggested that by simplifying all this it would make" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:32:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:32:29] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:03] <wikibugs>	 (03CR) 10Vgutierrez: "looking good, almost ready to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:34:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) 05In progress→03Resolved @Abhas: [[ https://ldap.toolforge.org/user/abhas | you have been added to the WMF ldap group ]]- which should provide you access to superset. **Please check acce...
[17:36:48] <wikibugs>	 (03PS2) 10Clément Goubert: apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296)
[17:38:19] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:28] <wikibugs>	 (03PS4) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[17:39:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P39973 and previous config saved to /var/cache/conftool/dbconfig/20221116-173928-ladsgroup.json
[17:39:59] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:44:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39975 and previous config saved to /var/cache/conftool/dbconfig/20221116-174430-ladsgroup.json
[17:44:46] <wikibugs>	 (03PS9) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[17:45:59] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[17:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:47:03] <wikibugs>	 (03PS5) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[17:47:10] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[17:48:49] <wikibugs>	 (03PS6) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[17:49:25] <wikibugs>	 (03PS7) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[17:49:41] <wikibugs>	 (03CR) 10Dzahn: "Thank you! Would you like me to wait for testing? Or can it be merged and the test is that there is no error? From my side what I can and " [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn)
[17:50:50] <wikibugs>	 (03PS5) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635)
[17:51:13] <wikibugs>	 (03PS2) 10Cathal Mooney: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635)
[17:51:27] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008
[17:51:51] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008
[17:51:56] <wikibugs>	 (03CR) 10Sergio Gimeno: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[17:53:02] <sukhe>	 !log rolling restart of varnish to pick up changes in T322903
[17:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:07] <stashbot>	 T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903
[17:53:12] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38259/console" [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[17:54:19] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:54:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323214)', diff saved to https://phabricator.wikimedia.org/P39976 and previous config saved to /var/cache/conftool/dbconfig/20221116-175434-ladsgroup.json
[17:54:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[17:54:40] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:54:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[17:54:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[17:55:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[17:55:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39977 and previous config saved to /var/cache/conftool/dbconfig/20221116-175511-ladsgroup.json
[17:56:23] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:01] <wikibugs>	 (03PS10) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[17:59:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39978 and previous config saved to /var/cache/conftool/dbconfig/20221116-175937-ladsgroup.json
[18:00:19] <wikibugs>	 (03CR) 10BCornwall: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[18:00:21] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[18:01:07] <wikibugs>	 (03CR) 10BCornwall: prometheus: Refactor ATS config monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[18:01:56] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38260/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[18:10:20] <urbanecm>	 !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` at mwmaint1002 (T318457)
[18:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:25] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[18:14:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323214)', diff saved to https://phabricator.wikimedia.org/P39979 and previous config saved to /var/cache/conftool/dbconfig/20221116-181443-ladsgroup.json
[18:14:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[18:14:49] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:14:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[18:15:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39980 and previous config saved to /var/cache/conftool/dbconfig/20221116-181505-ladsgroup.json
[18:20:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39981 and previous config saved to /var/cache/conftool/dbconfig/20221116-182059-ladsgroup.json
[18:21:06] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:22:47] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) **Service removal plan:** From https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service 1. Silence probes : `instance...
[18:25:03] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/857725
[18:25:05] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: set iDRAC host/domain names [cookbooks] - 10https://gerrit.wikimedia.org/r/857726
[18:25:34] <wikibugs>	 (03PS1) 10Dbrant: Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269)
[18:25:46] <wikibugs>	 (03CR) 10Volans: "To be tested on a host but should be ready for the eqsin refresh." [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans)
[18:26:05] <wikibugs>	 (03PS1) 10Dbrant: Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434
[18:26:08] <wikibugs>	 (03CR) 10Volans: "To be tested on a host but should be ready for the eqsin refresh." [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 (owner: 10Volans)
[18:26:36] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Ignore my previous message, was for the other CR. This one should *not* be merged before the eqsin refresh is completed!" [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans)
[18:33:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39982 and previous config saved to /var/cache/conftool/dbconfig/20221116-183336-ladsgroup.json
[18:33:42] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:36:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39983 and previous config saved to /var/cache/conftool/dbconfig/20221116-183605-ladsgroup.json
[18:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39984 and previous config saved to /var/cache/conftool/dbconfig/20221116-183714-ladsgroup.json
[18:37:19] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[18:45:54] <wikibugs>	 (03PS1) 10Brennen Bearnes: local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597)
[18:46:21] <wikibugs>	 (03PS1) 10Dzahn: phabricator: pass missing mysql.port paramater to local settings [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597)
[18:46:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597) (owner: 10Brennen Bearnes)
[18:47:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "should go together with https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/857734" [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[18:48:22] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Paired here. This should effectively be a no-op until scap changes are applied and a deploy is run." [puppet] - 10https://gerrit.wikimedia.org/r/857736 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[18:48:27] <wikibugs>	 (03PS1) 10Andrew Bogott: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737
[18:48:35] <wikibugs>	 (03PS1) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740
[18:48:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39986 and previous config saved to /var/cache/conftool/dbconfig/20221116-184843-ladsgroup.json
[18:51:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P39987 and previous config saved to /var/cache/conftool/dbconfig/20221116-185112-ladsgroup.json
[18:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39988 and previous config saved to /var/cache/conftool/dbconfig/20221116-185220-ladsgroup.json
[18:52:34] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] local settings: add mysql.port [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/857734 (https://phabricator.wikimedia.org/T280597) (owner: 10Brennen Bearnes)
[18:56:18] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config (hopefully)
[18:56:52] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config (hopefully) (duration: 00m 34s)
[18:58:54] <wikibugs>	 (03PS11) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815)
[19:00:04] <jouncebot>	 brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900).
[19:00:05] <jouncebot>	 brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900).
[19:00:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (owner: 10Jbond)
[19:00:24] <brennen>	 o/
[19:02:12] <brennen>	 !log train 1.40.0-wmf.10 (T320515) - no current blockers, rolling to group1.
[19:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:17] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[19:03:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39989 and previous config saved to /var/cache/conftool/dbconfig/20221116-190349-ladsgroup.json
[19:03:56] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515)
[19:03:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[19:05:19] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38261/console" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[19:06:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323214)', diff saved to https://phabricator.wikimedia.org/P39990 and previous config saved to /var/cache/conftool/dbconfig/20221116-190618-ladsgroup.json
[19:06:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:06:24] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:06:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P39991 and previous config saved to /var/cache/conftool/dbconfig/20221116-190640-ladsgroup.json
[19:07:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857742 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[19:07:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39992 and previous config saved to /var/cache/conftool/dbconfig/20221116-190727-ladsgroup.json
[19:11:11] <jelto>	 !log Imported jwt-authorizer 1.1.0-1 to bullseye-wikimedia - T322691
[19:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:16] <stashbot>	 T322691: Build and import new release of jwt-authorizer (1.1.0) - https://phabricator.wikimedia.org/T322691
[19:11:45] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.10  refs T320515
[19:11:50] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[19:15:28] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "LGTM with one slight cautionary note." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant)
[19:15:31] <wikibugs>	 (03PS4) 10Jdlrobson: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant)
[19:16:01] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.10  refs T320515 (duration: 04m 16s)
[19:18:17] <brennen>	 warnings here are higher than i'm really comfortable with and some canaries failed, i think i'm rolling this back to group0.
[19:18:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323214)', diff saved to https://phabricator.wikimedia.org/P39993 and previous config saved to /var/cache/conftool/dbconfig/20221116-191856-ladsgroup.json
[19:18:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[19:19:02] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:19:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[19:19:23] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515)
[19:19:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[19:19:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P39994 and previous config saved to /var/cache/conftool/dbconfig/20221116-191928-ladsgroup.json
[19:20:37] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857746 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[19:21:09] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:24] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:22:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[19:22:20] <wikibugs>	 (03PS2) 10Andrew Bogott: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737
[19:22:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318605)', diff saved to https://phabricator.wikimedia.org/P39995 and previous config saved to /var/cache/conftool/dbconfig/20221116-192233-ladsgroup.json
[19:22:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[19:22:38] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[19:22:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[19:22:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P39996 and previous config saved to /var/cache/conftool/dbconfig/20221116-192254-ladsgroup.json
[19:23:33] <wikibugs>	 (03PS1) 10Vgutierrez: secret: Add empty varnish/dp.master.key [labs/private] - 10https://gerrit.wikimedia.org/r/857751
[19:24:47] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.8  refs T320515
[19:24:52] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[19:25:47] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:26:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[19:28:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 (owner: 10Andrew Bogott)
[19:28:34] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.8  refs T320515 (duration: 03m 46s)
[19:31:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:23] <wikibugs>	 (03Merged) 10jenkins-bot: upgrade_openstack_node: Add db backups on cloudcontrols [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857737 (owner: 10Andrew Bogott)
[19:32:40] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:33:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[19:33:40] <wikibugs>	 (03PS1) 10Slyngshede: If bug in configuration parser. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756
[19:34:19] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:35:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P39997 and previous config saved to /var/cache/conftool/dbconfig/20221116-193540-ladsgroup.json
[19:35:46] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:36:42] <wikibugs>	 (03PS5) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621
[19:38:49] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:35] <wikibugs>	 (03PS5) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:40:40] <wikibugs>	 (03CR) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (owner: 10Dbrant)
[19:40:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P39998 and previous config saved to /var/cache/conftool/dbconfig/20221116-194040-ladsgroup.json
[19:42:53] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] secret: Add empty varnish/dp.master.key [labs/private] - 10https://gerrit.wikimedia.org/r/857751 (owner: 10Vgutierrez)
[19:44:29] <wikibugs>	 (03PS2) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740
[19:44:45] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:47:39] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8399960, @jhathaway wrote: >> How would this be different under Ansible? >> >> * I could render the template live on the server before committing >>...
[19:48:00] <wikibugs>	 (03PS3) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419)
[19:48:33] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:49:31] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[19:49:45] <wikibugs>	 (03PS4) 10Jbond: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419)
[19:50:25] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service,prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:50:32] <wikibugs>	 (03PS7) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:50:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[19:50:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P39999 and previous config saved to /var/cache/conftool/dbconfig/20221116-195046-ladsgroup.json
[19:51:53] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38264/console" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[19:52:41] <wikibugs>	 (03PS1) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761
[19:54:28] <wikibugs>	 (03PS8) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:55:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[19:55:16] <vgutierrez>	 sigh
[19:55:20] <vgutierrez>	 time to stop working I guess
[19:55:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P40000 and previous config saved to /var/cache/conftool/dbconfig/20221116-195546-ladsgroup.json
[19:56:10] <wikibugs>	 (03PS9) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[19:58:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott)
[19:59:22] <wikibugs>	 (03PS2) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761
[20:02:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott)
[20:02:46] <wikibugs>	 (03PS1) 10Urbanecm: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457)
[20:03:02] <wikibugs>	 (03PS1) 10Urbanecm: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457)
[20:03:20] <wikibugs>	 (03CR) 10Volans: "Couple of optional suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[20:03:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond)
[20:05:29] <wikibugs>	 (03PS3) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761
[20:05:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P40001 and previous config saved to /var/cache/conftool/dbconfig/20221116-200553-ladsgroup.json
[20:08:58] <wikibugs>	 (03PS1) 10Jforrester: [Beta Cluster] Point statsd service to prometheus-labmon, cloudmetrics1001 decom'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857763 (https://phabricator.wikimedia.org/T297712)
[20:09:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott)
[20:10:45] <wikibugs>	 (03PS1) 10Jforrester: changeprop: Point Beta Cluster metrics to prometheus-labmon, cloudmetrics1002 is gone [deployment-charts] - 10https://gerrit.wikimedia.org/r/857765 (https://phabricator.wikimedia.org/T297712)
[20:10:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P40002 and previous config saved to /var/cache/conftool/dbconfig/20221116-201053-ladsgroup.json
[20:12:54] <wikibugs>	 (03Abandoned) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[20:14:28] <wikibugs>	 (03PS4) 10Andrew Bogott: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761
[20:15:48] <wikibugs>	 (03CR) 10Vgutierrez: varnish: Generate a DP subkey daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[20:18:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott)
[20:21:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323214)', diff saved to https://phabricator.wikimedia.org/P40003 and previous config saved to /var/cache/conftool/dbconfig/20221116-202100-ladsgroup.json
[20:21:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[20:21:06] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:21:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[20:21:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40004 and previous config saved to /var/cache/conftool/dbconfig/20221116-202121-ladsgroup.json
[20:21:54] <wikibugs>	 (03Merged) 10jenkins-bot: upgrade_openstack_node: Backup databases regardless of what node is upgraded [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/857761 (owner: 10Andrew Bogott)
[20:22:41] <wikibugs>	 (03PS10) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[20:24:23] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:25:35] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:26:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323214)', diff saved to https://phabricator.wikimedia.org/P40005 and previous config saved to /var/cache/conftool/dbconfig/20221116-202602-ladsgroup.json
[20:26:09] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:30:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond)
[20:30:17] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:37:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40006 and previous config saved to /var/cache/conftool/dbconfig/20221116-203749-ladsgroup.json
[20:37:57] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:41:27] <sukhe>	 !log [finished] rolling restart of varnish to pick up changes in T322903
[20:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:31] <stashbot>	 T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903
[20:44:26] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: Add reboot message id for new idrac versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/857740 (https://phabricator.wikimedia.org/T322419) (owner: 10Jbond)
[20:48:10] <thcipriani>	 jouncebot: now
[20:48:10] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T1900)
[20:52:27] <thcipriani>	 brennen: am I interferring with train if I kick jenkins real quick?
[20:52:45] <brennen>	 thcipriani: go for it.
[20:52:49] * thcipriani does
[20:52:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40007 and previous config saved to /var/cache/conftool/dbconfig/20221116-205255-ladsgroup.json
[20:53:18] <thcipriani>	 !log restarting jenkins for update
[20:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P40008 and previous config saved to /var/cache/conftool/dbconfig/20221116-205347-ladsgroup.json
[20:53:52] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[20:55:35] <wikibugs>	 (03PS1) 10Urbanecm: GrowthExperiments: Run updateIsActiveFlagForMentees weekly [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457)
[20:56:13] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[20:57:57] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457)
[20:59:17] <wikibugs>	 (03PS6) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269)
[20:59:42] <wikibugs>	 (03CR) 10Andrea Denisse: Lower the TTL for netbox for the migration. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T2100).
[21:00:04] <jouncebot>	 dbrant and Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:06] <wikibugs>	 (03Abandoned) 10Andrea Denisse: Lower the TTL for netbox for the migration. [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[21:00:08] <wikibugs>	 (03CR) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:00:17] <urbanecm>	 I can deploy today
[21:00:21] <urbanecm>	 hi dbrant, are you around?
[21:00:34] * dbrant is present
[21:00:43] <urbanecm>	 great!
[21:00:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant)
[21:00:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:00:59] <urbanecm>	 dbrant: I posted a quick question in the config patch, can you have a look please?
[21:01:15] <dbrant>	 urbanecm: yep, looking
[21:02:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:02:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: Don't make unnecessary API call(s) for anonymized reading list preview. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant)
[21:03:18] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce Import button for launching deeplink into app. [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:03:33] <wikibugs>	 (03CR) 10Dbrant: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:03:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857434 (owner: 10Dbrant)
[21:03:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ReadingLists] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857433 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:04:09] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]]
[21:04:14] <stashbot>	 T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269
[21:04:36] <wikibugs>	 (03CR) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40009 and previous config saved to /var/cache/conftool/dbconfig/20221116-210802-ladsgroup.json
[21:08:42] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[21:08:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40010 and previous config saved to /var/cache/conftool/dbconfig/20221116-210854-ladsgroup.json
[21:09:05] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and dbrant: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:09:17] <urbanecm>	 dbrant: can you check the two backports at mwdebug1001 now please?
[21:10:19] <dbrant>	 checking, and...
[21:10:57] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[21:11:44] <dbrant>	 urbanecm: I believe it's updated, but it's also dependent on the config change.
[21:11:56] <urbanecm>	 i see, we can do that one next :)
[21:16:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:18:20] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[21:19:26] <wikibugs>	 (03PS7) 10Urbanecm: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:19:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:20:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Reading Lists landing page on a few smaller wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: updateIsActiveFlagForMentees: Treat "no edits" user correctly [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:21:44] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:857434|Don't make unnecessary API call(s) for anonymized reading list preview.]], [[gerrit:857433|Introduce Import button for launching deeplink into app. (T313269)]] (duration: 17m 34s)
[21:21:49] <stashbot>	 T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269
[21:22:10] <urbanecm>	 finally
[21:22:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857621 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant)
[21:22:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/857437 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:22:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857438 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:22:55] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]]
[21:23:01] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[21:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323214)', diff saved to https://phabricator.wikimedia.org/P40011 and previous config saved to /var/cache/conftool/dbconfig/20221116-212309-ladsgroup.json
[21:23:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[21:23:14] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[21:23:21] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm and dbrant: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2
[21:23:21] <logmsgbot>	 001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:23:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[21:23:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40012 and previous config saved to /var/cache/conftool/dbconfig/20221116-212330-ladsgroup.json
[21:23:35] <urbanecm>	 dbrant: config patch's at mwdebug1001 now, can you check?
[21:24:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40013 and previous config saved to /var/cache/conftool/dbconfig/20221116-212400-ladsgroup.json
[21:24:26] <dbrant>	 urbanecm: yay! looks good
[21:24:52] <urbanecm>	 great, syncing!
[21:29:01] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:857621|Enable Reading Lists landing page on a few smaller wikis. (T313269)]], [[gerrit:857437|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]], [[gerrit:857438|updateIsActiveFlagForMentees: Treat "no edits" user correctly (T318457)]] (duration: 06m 05s)
[21:29:02] <wikibugs>	 (03PS3) 10Urbanecm: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457)
[21:29:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:29:07] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[21:29:07] <stashbot>	 T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269
[21:29:11] <urbanecm>	 dbrant: and all live!
[21:29:13] <urbanecm>	 anything else?
[21:29:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:29:36] <dbrant>	 urbanecm: awesome, thanks as always!
[21:29:42] <urbanecm>	 no worries :)
[21:30:05] <wikibugs>	 (03PS8) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523)
[21:30:19] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[21:30:42] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]]
[21:31:06] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:31:55] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38266/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[21:33:12] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[21:35:43] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[21:37:26] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853482|[Growth] Do not override wgGEMentorshipUseIsActiveFlag (T318457)]] (duration: 06m 43s)
[21:37:32] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[21:37:41] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[21:37:44] <urbanecm>	 that should be all from me
[21:38:03] <urbanecm>	 !log Late UTC backport window done
[21:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318605)', diff saved to https://phabricator.wikimedia.org/P40014 and previous config saved to /var/cache/conftool/dbconfig/20221116-213907-ladsgroup.json
[21:39:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[21:39:12] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[21:39:22] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e08e32e]: (no justification provided)
[21:39:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[21:39:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T318605)', diff saved to https://phabricator.wikimedia.org/P40015 and previous config saved to /var/cache/conftool/dbconfig/20221116-213928-ladsgroup.json
[21:39:43] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e08e32e]: (no justification provided) (duration: 00m 20s)
[21:41:39] <urbanecm>	 !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php`for all wikis in growthexperiments.dblist (T318457)
[21:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:57] <wikibugs>	 (03PS9) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523)
[21:43:02] <wikibugs>	 (03PS1) 10Herron: dispatch: upgrade to 20221110 and build with local config.js [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229)
[21:43:30] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38267/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[21:46:12] <wikibugs>	 (03CR) 10Herron: "Approaching these at the same time since config.js changed significantly between versions" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron)
[21:47:20] <wikibugs>	 (03PS1) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[21:55:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[21:55:59] <wikibugs>	 (03PS1) 10Brennen Bearnes: specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184)
[21:56:23] <brennen>	 jouncebot: nowandnext
[21:56:23] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T2100)
[21:56:23] <jouncebot>	 In 9 hour(s) and 3 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0700)
[21:58:49] <wikibugs>	 10SRE, 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10Dzahn) possibly duplicate of automatically generated T323222
[21:59:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes)
[22:03:39] <wikibugs>	 (03PS2) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[22:04:31] <wikibugs>	 (03PS1) 10Urbanecm: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857785 (https://phabricator.wikimedia.org/T318457)
[22:07:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40016 and previous config saved to /var/cache/conftool/dbconfig/20221116-220710-ladsgroup.json
[22:07:15] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:11:35] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: fix pod ips for network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/857786 (https://phabricator.wikimedia.org/T321120)
[22:14:05] <wikibugs>	 (03Merged) 10jenkins-bot: specialpage: Silence known violation unsafe RequestContext changes [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes)
[22:15:47] <wikibugs>	 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) Thanks for all the feedback @Vgutierrez and @BBlack! Hopefully I've addressed all of your concerns. The dashboard at https://grafana.wikimedia.org/d/oMIu2XI4z...
[22:16:13] <wikibugs>	 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) 05Open→03In progress
[22:17:16] <wikibugs>	 (03PS3) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[22:18:02] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix pod ips for network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/857786 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[22:18:41] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179)
[22:20:38] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[22:20:41] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[22:20:48] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[22:20:52] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[22:22:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40017 and previous config saved to /var/cache/conftool/dbconfig/20221116-222216-ladsgroup.json
[22:24:14] <wikibugs>	 (03PS1) 10Ladsgroup: wikimedia.org portal: Make portal assets also visible in the vhost [puppet] - 10https://gerrit.wikimedia.org/r/857789 (https://phabricator.wikimedia.org/T273179)
[22:27:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/857439 (https://phabricator.wikimedia.org/T323184) (owner: 10Brennen Bearnes)
[22:27:45] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]]
[22:27:50] <stashbot>	 T323184: Special page transclusion: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323184
[22:28:11] <logmsgbot>	 !log brennen@deploy1002 brennen and brennen: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[22:28:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] wikimedia.org portal: Make portal assets also visible in the vhost [puppet] - 10https://gerrit.wikimedia.org/r/857789 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[22:32:06] <wikibugs>	 (03PS4) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[22:33:35] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:857439|specialpage: Silence known violation unsafe RequestContext changes (T323184)]] (duration: 05m 50s)
[22:33:41] <stashbot>	 T323184: Special page transclusion: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323184
[22:35:11] <wikibugs>	 (03PS5) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[22:36:15] <brennen>	 !log train 1.40.0-wmf.10 (T320515) - blocker seems resolved, making one attempt to roll to group1 again.
[22:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:20] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[22:36:38] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515)
[22:36:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[22:37:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:37:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40018 and previous config saved to /var/cache/conftool/dbconfig/20221116-223722-ladsgroup.json
[22:37:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:37:42] <wikibugs>	 (03CR) 10Jbond: redfish: add update commands using the patch method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[22:38:10] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857792 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[22:41:18] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Get rid of extract2.php module [puppet] - 10https://gerrit.wikimedia.org/r/857793
[22:42:13] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.10  refs T320515
[22:42:18] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[22:43:30] <wikibugs>	 (03PS1) 10Ladsgroup: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794
[22:43:57] <Amir1>	 jouncebot: nowandnext
[22:43:57] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 16 minute(s)
[22:43:57] <jouncebot>	 In 8 hour(s) and 16 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0700)
[22:44:09] <Amir1>	 oh noicio
[22:44:18] <Amir1>	 brennen: can I make some fire?
[22:44:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[22:45:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:45:35] <wikibugs>	 (03PS2) 10Ladsgroup: mediawiki: Get rid of extract2.php redirect [puppet] - 10https://gerrit.wikimedia.org/r/857793
[22:46:08] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.10  refs T320515 (duration: 03m 54s)
[22:46:41] <wikibugs>	 (03PS6) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[22:46:57] <brennen>	 Amir1: i'm trying to decide whether to roll back again based on number of notices at the moment
[22:47:25] <Amir1>	 let me know once you're done. I have no rush, I have to wait for puppet to take affect any way
[22:52:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323214)', diff saved to https://phabricator.wikimedia.org/P40019 and previous config saved to /var/cache/conftool/dbconfig/20221116-225229-ladsgroup.json
[22:52:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:52:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:52:52] <brennen>	 Amir1: go ahead
[22:53:02] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[22:53:46] <Amir1>	 awesome
[22:54:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:54:58] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[22:55:46] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857788 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[22:57:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:58:06] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[22:58:27] <Amir1>	 works fine in mwdebug1001, moving forward
[22:58:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:59:40] <brennen>	 Amir1: holler when you're done.  i might roll this back out of an abundance of caution before i step afk for the day.
[22:59:49] <Amir1>	 sure
[23:01:30] <brennen>	 thanks. :)
[23:02:07] <brennen>	 meanwhile: tea.
[23:03:52] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: (no justification provided) (duration: 03m 49s)
[23:04:40] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[23:05:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[23:07:12] <Amir1>	 I'm not done yet but if it's something I can fix to avoid train being stuck, can you tell me? is it the same blocker?
[23:07:29] <brennen>	 https://phabricator.wikimedia.org/T323184#8401081
[23:07:41] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals: (no justification provided) (duration: 03m 48s)
[23:07:46] <brennen>	 just a lot of noise, i think
[23:08:03] <brennen>	 but tends to make us nervous about canaries and other things getting lost in error rates.
[23:08:17] <Amir1>	 you makes sense
[23:08:20] <Amir1>	 *yeah
[23:08:33] <wikibugs>	 (03PS4) 10Ladsgroup: Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179)
[23:08:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[23:09:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[23:09:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add w/api/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup)
[23:09:58] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]]
[23:09:59] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:10:03] <stashbot>	 T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179
[23:10:22] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[23:11:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:12:37] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[23:12:39] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:13:09] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:15:24] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:856030|Add w/api/index.html (T273179)]] (duration: 05m 26s)
[23:15:29] <stashbot>	 T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179
[23:15:56] <Amir1>	 brennen: I'm good for now :)
[23:16:30] <brennen>	 Amir1: cool, thanks.  rolling train back to group0 for the moment.
[23:16:49] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515)
[23:16:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[23:16:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318605)', diff saved to https://phabricator.wikimedia.org/P40020 and previous config saved to /var/cache/conftool/dbconfig/20221116-231654-ladsgroup.json
[23:16:59] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[23:17:29] <wikibugs>	 (03PS3) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites [puppet] - 10https://gerrit.wikimedia.org/r/857793
[23:17:34] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857799 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot)
[23:20:09] <wikibugs>	 (03PS2) 10Ladsgroup: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179)
[23:21:42] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.8  refs T320515
[23:21:47] <stashbot>	 T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515
[23:24:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:25:26] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.8  refs T320515 (duration: 03m 43s)
[23:26:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[23:26:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[23:32:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40021 and previous config saved to /var/cache/conftool/dbconfig/20221116-233200-ladsgroup.json
[23:38:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] changeprop: Point Beta Cluster metrics to prometheus-labmon, cloudmetrics1002 is gone [deployment-charts] - 10https://gerrit.wikimedia.org/r/857765 (https://phabricator.wikimedia.org/T297712) (owner: 10Jforrester)
[23:39:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] [Beta Cluster] Point statsd service to prometheus-labmon, cloudmetrics1001 decom'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857763 (https://phabricator.wikimedia.org/T297712) (owner: 10Jforrester)
[23:42:53] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[23:43:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[23:43:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T323214)', diff saved to https://phabricator.wikimedia.org/P40022 and previous config saved to /var/cache/conftool/dbconfig/20221116-234323-ladsgroup.json
[23:43:28] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[23:47:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40023 and previous config saved to /var/cache/conftool/dbconfig/20221116-234708-ladsgroup.json