[00:00:55] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:21] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:53] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:26] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [00:27:41] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [00:34:54] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:13] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [00:42:07] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:09] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:19] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [01:14:44] some slow pageloads and 503s [01:14:45] From #-tech: [01:14:45] Dragonfly6-7 I'm trying to create a page on Commons and I've twice gotten the error message: [01:14:46] [21:14:01] Dragonfly6-7 upstream connect error or disconnect/reset before headers. reset reason: overflow [01:14:46] [21:14:21] Dragonfly6-7 thrice now [01:14:55] ha, beat you by a second [01:14:59] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:00] :(. Want to page? [01:15:27] above my paygrade. hey TheresNoTime you're online [01:15:34] I am [01:15:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:15:39] jinxer-wm wins [01:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:16:07] hi [01:16:11] Call me a book, cos I got pages [01:16:18] (ProbeDown) firing: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:16:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:16:34] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:17:00] Monitoring was a bit slow on that one to be honest, I noticed timeouts before jinxer piped up [01:17:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:19:59] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:20:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:20:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:21:18] (ProbeDown) resolved: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:21:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:21:34] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:21:46] this should be resolved, still digging a little but speak up if you still see errors or slowness :) [01:22:27] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:59] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:03] rzl, it's back [01:45:27] thanks [01:46:00] Hello, is there any kind of maintenance on MediaWiki.org? [01:46:14] I'm asking because when I want to go there, I get this: [01:46:15] upstream connect error or disconnect/reset before headers. reset reason: overflow [01:46:18] (ProbeDown) firing: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:20] yes, it's everything [01:46:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:46:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:46:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:46:55] Vermont: You think like Wikipedia and on all other projects? [01:47:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:48:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:48:47] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:51:03] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:51:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:51:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [01:51:42] sorry for the trouble :) everything should be recovered now, let us know if you're still having problems [01:51:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:52:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:17] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:36:25] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:52:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:37] PROBLEM - puppet last run on an-worker1127 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:05:25] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:47] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:17] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:05:43] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:37:41] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:26] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10lmata) [05:00:01] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:07:09] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:14:33] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:17:03] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:21:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:21:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:21:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2079.codfw.wmnet with reason: Maintenance [05:21:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2079.codfw.wmnet with reason: Maintenance [05:21:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 15 hosts with reason: Maintenance [05:22:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 15 hosts with reason: Maintenance [05:22:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:22:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:22:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1111.eqiad.wmnet with reason: Maintenance [05:22:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1111.eqiad.wmnet with reason: Maintenance [05:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T313070)', diff saved to https://phabricator.wikimedia.org/P31257 and previous config saved to /var/cache/conftool/dbconfig/20220718-052250-marostegui.json [05:22:54] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [05:23:30] (03PS1) 10Marostegui: instances.yaml: Remove db2082 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/814400 (https://phabricator.wikimedia.org/T313003) [05:25:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2082 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/814400 (https://phabricator.wikimedia.org/T313003) (owner: 10Marostegui) [05:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2082 T313003', diff saved to https://phabricator.wikimedia.org/P31258 and previous config saved to /var/cache/conftool/dbconfig/20220718-052605-marostegui.json [05:26:09] T313003: decommission db2082 - https://phabricator.wikimedia.org/T313003 [05:26:16] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) 05Open→03Declined this is superseded by grizzly, which is already in production for SLO dashboarding [05:26:45] (03PS1) 10Marostegui: mariadb: Remove db2082 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/814510 (https://phabricator.wikimedia.org/T313003) [05:34:01] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SyntaxHighlight_GeSHi/+/812911 is related I believe. [05:36:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2082.codfw.wmnet [05:39:50] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:43:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:44:09] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1111 had issues a few hours ago and it had performance schema disabled: ` mysql:root@localhost [(none)]> show global... [05:44:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2082 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/814510 (https://phabricator.wikimedia.org/T313003) (owner: 10Marostegui) [05:46:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2082.codfw.wmnet [05:46:52] 10ops-codfw, 10decommission-hardware: decommission db2082 - https://phabricator.wikimedia.org/T313003 (10Marostegui) a:03Papaul [05:46:58] 10ops-codfw, 10decommission-hardware: decommission db2082 - https://phabricator.wikimedia.org/T313003 (10Marostegui) @Papaul host ready for you [05:48:12] (03PS1) 10Marostegui: instances.yaml: Add db2166 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/814572 (https://phabricator.wikimedia.org/T311493) [05:49:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2166 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/814572 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2166 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31259 and previous config saved to /var/cache/conftool/dbconfig/20220718-055051-marostegui.json [05:50:55] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:51:30] (03PS1) 10Marostegui: db2166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814573 (https://phabricator.wikimedia.org/T311493) [05:52:37] (03CR) 10Marostegui: [C: 03+2] db2166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814573 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:15:46] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:17:28] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:21:04] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T313070)', diff saved to https://phabricator.wikimedia.org/P31260 and previous config saved to /var/cache/conftool/dbconfig/20220718-062304-marostegui.json [06:23:12] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [06:24:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [06:24:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [06:26:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1135.eqiad.wmnet with reason: Maintenance [06:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1135.eqiad.wmnet with reason: Maintenance [06:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T312984)', diff saved to https://phabricator.wikimedia.org/P31261 and previous config saved to /var/cache/conftool/dbconfig/20220718-062648-ladsgroup.json [06:26:53] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [06:27:06] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312984)', diff saved to https://phabricator.wikimedia.org/P31262 and previous config saved to /var/cache/conftool/dbconfig/20220718-063155-ladsgroup.json [06:32:01] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [06:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P31263 and previous config saved to /var/cache/conftool/dbconfig/20220718-063809-marostegui.json [06:47:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P31264 and previous config saved to /var/cache/conftool/dbconfig/20220718-064700-ladsgroup.json [06:49:05] (03PS6) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [06:49:34] (03CR) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [06:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P31265 and previous config saved to /var/cache/conftool/dbconfig/20220718-065315-marostegui.json [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] you can self-serve I assume? [07:00:50] * kart_ is here. [07:00:57] Amir1: Self deploy :) [07:01:58] (03CR) 10KartikMistry: [C: 03+2] Enable Content and Section translation on WPs with NLLB-200 MT support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814015 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P31266 and previous config saved to /var/cache/conftool/dbconfig/20220718-070205-ladsgroup.json [07:02:50] (03Merged) 10jenkins-bot: Enable Content and Section translation on WPs with NLLB-200 MT support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814015 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:03:08] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:05:02] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:07:48] Change looks good. Deploying.. [07:07:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T313070)', diff saved to https://phabricator.wikimedia.org/P31267 and previous config saved to /var/cache/conftool/dbconfig/20220718-070820-marostegui.json [07:08:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:08:24] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [07:08:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T313070)', diff saved to https://phabricator.wikimedia.org/P31268 and previous config saved to /var/cache/conftool/dbconfig/20220718-070840-marostegui.json [07:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:09:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T313070)', diff saved to https://phabricator.wikimedia.org/P31269 and previous config saved to /var/cache/conftool/dbconfig/20220718-070946-marostegui.json [07:09:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:10:55] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814015|Enable Content and Section translation on WPs with NLLB-200 MT support (T309384)]] (duration: 02m 53s) [07:11:00] T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384 [07:11:13] I'm done. [07:17:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312984)', diff saved to https://phabricator.wikimedia.org/P31270 and previous config saved to /var/cache/conftool/dbconfig/20220718-071711-ladsgroup.json [07:17:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:17:17] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [07:17:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:18:44] Amir1: Help needed. I can see config change I deployed showing in mwdebug1001 but not in Production. What can be possible reason(s)? [07:19:05] kart_: forgot rebase? [07:19:10] Amir1: ah. Cache. No worry. [07:19:18] cool [07:19:32] No. Works fine now. We do rebase and then only test on mwdebug, right? [07:19:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:19:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:20:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:21:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bullseye [07:21:07] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bullseye [07:22:01] (03PS2) 10Muehlenhoff: rancid: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013) [07:22:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:22:08] kart_: yeah [07:22:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:22:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 13 hosts with reason: Maintenance [07:22:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 13 hosts with reason: Maintenance [07:24:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:24:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31271 and previous config saved to /var/cache/conftool/dbconfig/20220718-072451-marostegui.json [07:25:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:26:02] (03CR) 10Muehlenhoff: [C: 03+2] rancid: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:26:52] (03PS1) 10KartikMistry: Enable ContentTranslation out of Beta for ay, ilo, kg, ln, nso, and tn Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814706 (https://phabricator.wikimedia.org/T309384) [07:27:00] Amir1: I've one more followup patch, adding to the calendar. [07:27:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:27:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:28:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:29:17] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/813595 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:29:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1118.eqiad.wmnet with reason: Maintenance [07:29:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1118.eqiad.wmnet with reason: Maintenance [07:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T312984)', diff saved to https://phabricator.wikimedia.org/P31272 and previous config saved to /var/cache/conftool/dbconfig/20220718-072953-ladsgroup.json [07:29:58] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [07:30:21] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813596 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:30:28] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: blackbox_exporter: remove un-managed module files [puppet] - 10https://gerrit.wikimedia.org/r/813654 (owner: 10Majavah) [07:30:58] moritzm: merged your change too [07:32:22] (03CR) 10KartikMistry: [C: 03+2] Enable ContentTranslation out of Beta for ay, ilo, kg, ln, nso, and tn Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814706 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:32:35] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813597 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:32:54] godog: ack, thx [07:33:09] (03Merged) 10jenkins-bot: Enable ContentTranslation out of Beta for ay, ilo, kg, ln, nso, and tn Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814706 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:33:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [07:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312984)', diff saved to https://phabricator.wikimedia.org/P31273 and previous config saved to /var/cache/conftool/dbconfig/20220718-073359-ladsgroup.json [07:34:46] (03PS1) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist [puppet] - 10https://gerrit.wikimedia.org/r/814707 [07:34:50] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813598 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:35:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [07:36:59] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813599 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:37:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:38:23] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813600 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:38:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:38:41] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813601 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:38:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [07:39:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31274 and previous config saved to /var/cache/conftool/dbconfig/20220718-073956-marostegui.json [07:40:16] (03PS2) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist [puppet] - 10https://gerrit.wikimedia.org/r/814707 [07:40:23] (03PS3) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist [puppet] - 10https://gerrit.wikimedia.org/r/814707 [07:40:24] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814706|Enable ContentTranslation out of Beta for ay, ilo, kg, ln, nso, and tn Wikipedias (T309384)]] (duration: 02m 51s) [07:40:27] T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384 [07:40:45] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813602 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:41:08] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:41:10] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:41:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:41:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:41:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:42:12] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [07:42:36] Am I too late for deploying a config patch? [07:42:52] (03CR) 10Muehlenhoff: "There's also three manifests in modules/profile/manifests/bird, could you please also include them?" [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:43:40] Amir1 urbanecm ^ [07:44:07] kostajh: nope, I think kart_ is done [07:44:35] * urbanecm waves [07:45:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:45:53] (03PS1) 10Kosta Harlan: Structured task: Disable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814708 (https://phabricator.wikimedia.org/T304099) [07:45:56] cool, will add to calendar and deploy, then [07:46:20] kostajh: yes. Go ahead. [07:46:56] (03CR) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:47:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2028.codfw.wmnet with OS bullseye [07:47:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bullseye executed with errors: - ganeti2028 (**FAIL**) - D... [07:47:59] urbanecm: if you're around, can you glance at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/814708 ? [07:48:18] Sure [07:48:41] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814708 (https://phabricator.wikimedia.org/T304099) (owner: 10Kosta Harlan) [07:48:43] 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) 05Open→03Stalled Space is freed now, and we are at ~73% bytes used overall. I'll stall the task and check back in 45/50 days to assess the situation again and act accordingly [07:48:49] Patch looks good [07:49:03] (03CR) 10Kosta Harlan: [C: 03+2] Structured task: Disable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814708 (https://phabricator.wikimedia.org/T304099) (owner: 10Kosta Harlan) [07:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P31275 and previous config saved to /var/cache/conftool/dbconfig/20220718-074904-ladsgroup.json [07:49:07] urbanecm: thanks! [07:49:16] Np [07:49:59] (03Merged) 10jenkins-bot: Structured task: Disable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814708 (https://phabricator.wikimedia.org/T304099) (owner: 10Kosta Harlan) [07:50:56] (03CR) 10David Caro: wmcs: vps: remove_instance: add support for puppet deactivation (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [07:51:22] (03PS1) 10Ladsgroup: admin: Revoke foks' production access temporarily [puppet] - 10https://gerrit.wikimedia.org/r/814709 [07:54:23] !log kharlan@deploy1002 Synchronized wmf-config: Config: [[gerrit:814708|Structured task: Disable free text for "other" rejection reason (T304099)]] (duration: 02m 41s) [07:54:28] T304099: Structured tasks: temporary free text for "other" rejection reason - https://phabricator.wikimedia.org/T304099 [07:54:52] ok, I'm done [07:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T313070)', diff saved to https://phabricator.wikimedia.org/P31276 and previous config saved to /var/cache/conftool/dbconfig/20220718-075501-marostegui.json [07:55:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:55:06] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [07:55:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:55:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:55:23] (03CR) 10ArielGlenn: "This looks ok to me, though I've not tested it. Two questions: do we really want to log every time we find it still running? Would someone" [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [07:55:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:55:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T313070)', diff saved to https://phabricator.wikimedia.org/P31277 and previous config saved to /var/cache/conftool/dbconfig/20220718-075527-marostegui.json [07:55:31] (03CR) 10David Caro: wmcs: toolforge: add a cookbook to remove a grid node (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [07:56:21] (03CR) 10David Caro: "LGTM, let me know if you wont to merge as is and I'll +2, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [07:56:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Revoke foks' production access temporarily [puppet] - 10https://gerrit.wikimedia.org/r/814709 (owner: 10Ladsgroup) [07:56:36] (03PS2) 10Ladsgroup: admin: Revoke foks' production access temporarily [puppet] - 10https://gerrit.wikimedia.org/r/814709 [07:56:39] (03CR) 10Ladsgroup: [V: 03+2] admin: Revoke foks' production access temporarily [puppet] - 10https://gerrit.wikimedia.org/r/814709 (owner: 10Ladsgroup) [07:57:54] huh? [07:58:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:58:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:59:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:00:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [08:00:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2028.codfw.wmnet [08:00:49] (03CR) 10Filippo Giunchedi: "Thank you Mark for tackling this!" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [08:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P31278 and previous config saved to /var/cache/conftool/dbconfig/20220718-080409-ladsgroup.json [08:07:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T313070)', diff saved to https://phabricator.wikimedia.org/P31279 and previous config saved to /var/cache/conftool/dbconfig/20220718-080735-marostegui.json [08:07:40] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [08:09:11] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:09:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Leila - https://phabricator.wikimedia.org/T313134 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe Hi @leila I'm a bit surprised you're not in the wmf ldap group! I found two developer accounts that are linked to your email @wikimedia.org.n... [08:10:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [08:10:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2028.codfw.wmnet [08:11:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [08:11:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2028.codfw.wmnet [08:12:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [08:12:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2028.codfw.wmnet [08:13:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2012.codfw.wmnet with OS bullseye [08:13:11] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS bullseye [08:15:02] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) p:05Medium→03High Increasing priority to high since it's an accident waiting to happen. [08:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312984)', diff saved to https://phabricator.wikimedia.org/P31280 and previous config saved to /var/cache/conftool/dbconfig/20220718-081914-ladsgroup.json [08:19:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance [08:19:20] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [08:19:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance [08:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T312984)', diff saved to https://phabricator.wikimedia.org/P31281 and previous config saved to /var/cache/conftool/dbconfig/20220718-081934-ladsgroup.json [08:22:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31282 and previous config saved to /var/cache/conftool/dbconfig/20220718-082241-marostegui.json [08:23:29] (03PS2) 10Zabe: bird: Add SPDX headers to bird profile [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013) [08:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312984)', diff saved to https://phabricator.wikimedia.org/P31283 and previous config saved to /var/cache/conftool/dbconfig/20220718-082342-ladsgroup.json [08:24:59] (03CR) 10Zabe: bird: Add SPDX headers to bird profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:27:57] (03PS3) 10Majavah: dynamicproxy: urlproxy: add a simple rate limit [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) [08:28:31] (03CR) 10Majavah: dynamicproxy: urlproxy: add a simple rate limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [08:29:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2012.codfw.wmnet with reason: host reimage [08:30:19] (03CR) 10Majavah: wmcs: vps: remove_instance: add support for puppet deactivation (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [08:33:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2012.codfw.wmnet with reason: host reimage [08:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31284 and previous config saved to /var/cache/conftool/dbconfig/20220718-083746-marostegui.json [08:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P31285 and previous config saved to /var/cache/conftool/dbconfig/20220718-083847-ladsgroup.json [08:41:27] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/813603 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:42:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2012.codfw.wmnet with OS bullseye [08:43:03] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS bullseye executed with errors: - ganeti2012 (**FAIL**) - D... [08:45:38] (03CR) 10JMeybohm: [C: 03+2] Actually run tests on type: php scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/813843 (owner: 10JMeybohm) [08:48:27] (03PS1) 10Elukey: ml-services: update Docker image for editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/814718 (https://phabricator.wikimedia.org/T301878) [08:48:43] (03CR) 10David Caro: [C: 03+2] wmcs: vps: remove_instance: add support for puppet deactivation (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [08:48:51] (03CR) 10David Caro: [C: 03+2] wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [08:49:24] (03Merged) 10jenkins-bot: Actually run tests on type: php scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/813843 (owner: 10JMeybohm) [08:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T313070)', diff saved to https://phabricator.wikimedia.org/P31286 and previous config saved to /var/cache/conftool/dbconfig/20220718-085251-marostegui.json [08:52:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1109.eqiad.wmnet with reason: Maintenance [08:52:57] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [08:53:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1109.eqiad.wmnet with reason: Maintenance [08:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T313070)', diff saved to https://phabricator.wikimedia.org/P31287 and previous config saved to /var/cache/conftool/dbconfig/20220718-085312-marostegui.json [08:53:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P31288 and previous config saved to /var/cache/conftool/dbconfig/20220718-085352-ladsgroup.json [08:55:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T313070)', diff saved to https://phabricator.wikimedia.org/P31289 and previous config saved to /var/cache/conftool/dbconfig/20220718-085518-marostegui.json [08:55:38] (03Merged) 10jenkins-bot: wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [08:55:40] (03Merged) 10jenkins-bot: wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [08:56:11] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for editquality goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/814718 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [08:56:19] (03PS1) 10Filippo Giunchedi: pontoon: retry apt in provision.sh [puppet] - 10https://gerrit.wikimedia.org/r/814719 [08:56:21] (03PS1) 10Filippo Giunchedi: pontoon: validate host fqdn during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/814720 [08:56:23] (03PS1) 10Filippo Giunchedi: pontoon: support to set/override domain during provisioning [puppet] - 10https://gerrit.wikimedia.org/r/814721 [08:58:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:59:59] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [09:03:07] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [09:05:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2028.codfw.wmnet [09:08:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312984)', diff saved to https://phabricator.wikimedia.org/P31290 and previous config saved to /var/cache/conftool/dbconfig/20220718-090857-ladsgroup.json [09:09:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:09:07] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [09:09:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31291 and previous config saved to /var/cache/conftool/dbconfig/20220718-090919-ladsgroup.json [09:10:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31292 and previous config saved to /var/cache/conftool/dbconfig/20220718-091023-marostegui.json [09:10:49] (03PS1) 10Majavah: P:toolforge::proxy: raise rate limit + add hiera config [puppet] - 10https://gerrit.wikimedia.org/r/814722 (https://phabricator.wikimedia.org/T313131) [09:12:21] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814722 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [09:13:31] * urbanecm staging at mwdebug1001 [09:13:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31293 and previous config saved to /var/cache/conftool/dbconfig/20220718-091340-ladsgroup.json [09:14:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:20] * urbanecm done [09:17:56] (03CR) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [09:18:51] (03PS4) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist [puppet] - 10https://gerrit.wikimedia.org/r/814707 [09:19:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 T311106', diff saved to https://phabricator.wikimedia.org/P31295 and previous config saved to /var/cache/conftool/dbconfig/20220718-091957-root.json [09:20:01] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [09:21:14] (03CR) 10ArielGlenn: [C: 03+1] "Giving my thumbs up (but note I have not tested it)." [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [09:21:23] (03CR) 10JMeybohm: [C: 03+1] mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 (https://phabricator.wikimedia.org/T312634) (owner: 10Giuseppe Lavagetto) [09:25:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31297 and previous config saved to /var/cache/conftool/dbconfig/20220718-092528-marostegui.json [09:27:45] (03CR) 10Elukey: [C: 03+1] "The config looks good, maybe let's ask to Moritz if the choice of the cumin aliases is ok or not." [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [09:28:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [09:28:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P31298 and previous config saved to /var/cache/conftool/dbconfig/20220718-092845-ladsgroup.json [09:33:47] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [09:34:43] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10MoritzMuehlenhoff) This seems to happen again, today's reimages of ganeti2012 and ganeti2028 failed since the host key change didn't get properl... [09:35:10] (03PS1) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [09:36:04] (03PS1) 10David Caro: rabbit.drain_queue: Don't fail if the queue has no messages [puppet] - 10https://gerrit.wikimedia.org/r/814726 [09:36:10] (03PS1) 10Majavah: dynamicproxy: urlproxy: enable bursting in rate limits [puppet] - 10https://gerrit.wikimedia.org/r/814727 (https://phabricator.wikimedia.org/T313131) [09:36:13] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [09:37:52] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) [09:38:12] (03CR) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [09:38:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) p:05Triage→03Medium [09:39:40] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814727 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [09:40:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T313070)', diff saved to https://phabricator.wikimedia.org/P31299 and previous config saved to /var/cache/conftool/dbconfig/20220718-094033-marostegui.json [09:40:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:40:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:40:39] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [09:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31300 and previous config saved to /var/cache/conftool/dbconfig/20220718-094043-marostegui.json [09:41:07] ACKNOWLEDGEMENT - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service Valentin Gutierrez T313206 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:07] ACKNOWLEDGEMENT - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service Valentin Gutierrez T313206 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31301 and previous config saved to /var/cache/conftool/dbconfig/20220718-094150-marostegui.json [09:42:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 (https://phabricator.wikimedia.org/T312634) (owner: 10Giuseppe Lavagetto) [09:42:18] (03CR) 10ArielGlenn: [C: 03+1] kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [09:43:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P31302 and previous config saved to /var/cache/conftool/dbconfig/20220718-094351-ladsgroup.json [09:45:36] (03CR) 10David Caro: kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [09:45:47] (03CR) 10Hashar: "recheck" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [09:46:26] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8065318, @cmooney wrote: > @volans could you point me at any existing custom_facts and the cod... [09:52:56] (03CR) 10Volans: [C: 03+1] "Code looks nicer and simpler and I like the description diff that mentions the actual name. But I'll leave it to Cathal and you to decide " [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) (owner: 10Ayounsi) [09:53:17] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:51] (03PS1) 10Ladsgroup: Add change_templatelinks_pk.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) [09:56:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31303 and previous config saved to /var/cache/conftool/dbconfig/20220718-095656-marostegui.json [09:57:24] (03CR) 10Marostegui: [C: 03+1] Add change_templatelinks_pk.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) (owner: 10Ladsgroup) [09:57:40] (03CR) 10Ladsgroup: [C: 03+2] Add change_templatelinks_pk.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) (owner: 10Ladsgroup) [09:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31304 and previous config saved to /var/cache/conftool/dbconfig/20220718-095856-ladsgroup.json [09:58:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:59:01] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [09:59:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31305 and previous config saved to /var/cache/conftool/dbconfig/20220718-095916-ladsgroup.json [10:00:14] (03Merged) 10jenkins-bot: Add change_templatelinks_pk.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) (owner: 10Ladsgroup) [10:00:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31306 and previous config saved to /var/cache/conftool/dbconfig/20220718-100329-ladsgroup.json [10:06:04] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: simplify interface description for circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) (owner: 10Ayounsi) [10:07:08] (03PS1) 10Majavah: P:prometheus:openstack_exporter: disable slow metrics [puppet] - 10https://gerrit.wikimedia.org/r/814738 [10:09:06] (03CR) 10Majavah: wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [10:11:55] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31307 and previous config saved to /var/cache/conftool/dbconfig/20220718-101201-marostegui.json [10:16:49] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:17:20] uh :) [10:18:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P31308 and previous config saved to /var/cache/conftool/dbconfig/20220718-101834-ladsgroup.json [10:19:11] (03CR) 10Volans: [C: 03+1] "LGTM, minor nit inline." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [10:20:15] (03CR) 10David Caro: [C: 03+2] wmcs: Add novafullstack alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [10:21:47] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:22:54] (03Merged) 10jenkins-bot: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274 (owner: 10David Caro) [10:23:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:23:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:23:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [10:24:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [10:25:23] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814738 (owner: 10Majavah) [10:26:09] !log dbmaint on s5@codfw (T312863) [10:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:13] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:26:24] !log dbmaint on s5@eqiad (T312863) [10:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] I forgot cebwiki is on s5, 200GB table is being altered now [10:27:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31310 and previous config saved to /var/cache/conftool/dbconfig/20220718-102706-marostegui.json [10:27:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:27:14] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [10:27:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T313070)', diff saved to https://phabricator.wikimedia.org/P31311 and previous config saved to /var/cache/conftool/dbconfig/20220718-102726-marostegui.json [10:28:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T313070)', diff saved to https://phabricator.wikimedia.org/P31312 and previous config saved to /var/cache/conftool/dbconfig/20220718-102832-marostegui.json [10:29:11] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:29:20] (03PS3) 10Ayounsi: Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) [10:29:34] (03CR) 10Ayounsi: "Thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [10:30:58] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [10:31:41] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:33:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P31313 and previous config saved to /var/cache/conftool/dbconfig/20220718-103339-ladsgroup.json [10:33:42] (03PS1) 10Ayounsi: Remove test_juniper_inventory_descs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) [10:34:29] (03CR) 10CI reject: [V: 04-1] Remove test_juniper_inventory_descs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) (owner: 10Ayounsi) [10:35:15] (03CR) 10Ayounsi: [C: 03+2] Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [10:36:19] (03Merged) 10jenkins-bot: Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [10:43:26] (03PS2) 10Muehlenhoff: calico: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013) [10:43:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31314 and previous config saved to /var/cache/conftool/dbconfig/20220718-104337-marostegui.json [10:46:37] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:46:52] (03CR) 10Muehlenhoff: [C: 03+2] calico: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:48:40] !log disable puppet fleet wide to resync db [10:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312984)', diff saved to https://phabricator.wikimedia.org/P31315 and previous config saved to /var/cache/conftool/dbconfig/20220718-104844-ladsgroup.json [10:48:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:48:49] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:49:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:49:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:49:07] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:49:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:49:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312984)', diff saved to https://phabricator.wikimedia.org/P31316 and previous config saved to /var/cache/conftool/dbconfig/20220718-104921-ladsgroup.json [10:51:03] (03PS2) 10Ayounsi: Remove test_juniper_inventory_descs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) [10:51:06] (03CR) 10Muehlenhoff: [C: 03+2] build_envoy_deb.sh: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812294 (owner: 10Muehlenhoff) [10:52:55] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809616 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312984)', diff saved to https://phabricator.wikimedia.org/P31317 and previous config saved to /var/cache/conftool/dbconfig/20220718-105411-ladsgroup.json [10:54:16] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:54:22] (03CR) 10Ayounsi: [C: 03+2] "Self merging to clear the Netbox report. Feel free to do a post merge review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) (owner: 10Ayounsi) [10:55:08] (03Merged) 10jenkins-bot: Remove test_juniper_inventory_descs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) (owner: 10Ayounsi) [10:55:17] RECOVERY - puppet last run on an-worker1127 is OK: OK: Puppet is currently disabled (re-sync postgres), not alerting. Last run 13 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:56:06] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#8083691, @MoritzMuehlenhoff wrote: > This seems to happen again, today's reimages of ganeti2012 and ganeti2028 failed sinc... [10:56:35] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1688 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:57:14] (03CR) 10Volans: [C: 03+1] "post-merge +1" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/814742 (https://phabricator.wikimedia.org/T305126) (owner: 10Ayounsi) [10:58:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31318 and previous config saved to /var/cache/conftool/dbconfig/20220718-105843-marostegui.json [11:00:34] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) For the record, the last state change of the Icinga alert that is alerting since then is `Last State Change: 2022-07-12 16:11:29`, re... [11:01:35] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [11:03:42] (03CR) 10Volans: [C: 03+1] "Looks ok to me too." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [11:08:28] (03PS2) 10Ayounsi: Interface description: handle patch panels properly [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) [11:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P31319 and previous config saved to /var/cache/conftool/dbconfig/20220718-110916-ladsgroup.json [11:10:01] (03PS3) 10Ayounsi: Interface description: handle patch panels properly [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) [11:10:30] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [11:10:47] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Interface description: handle patch panels properly [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/813589 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [11:11:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 99 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:59] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 127 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:12:37] (03CR) 10Volans: "Unit tests needs adapting to cover the new code." [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [11:12:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 116 probes of 684 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:13:13] (03PS1) 10Marostegui: db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814752 (https://phabricator.wikimedia.org/T311493) [11:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T313070)', diff saved to https://phabricator.wikimedia.org/P31322 and previous config saved to /var/cache/conftool/dbconfig/20220718-111348-marostegui.json [11:13:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:13:55] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [11:14:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31323 and previous config saved to /var/cache/conftool/dbconfig/20220718-111409-marostegui.json [11:14:48] (03CR) 10Marostegui: [C: 03+2] db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814752 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:15:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31324 and previous config saved to /var/cache/conftool/dbconfig/20220718-111515-marostegui.json [11:15:19] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 36 probes of 768 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:16:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:16:27] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:18:08] (03PS1) 10Marostegui: install_server: Do not reimage db216[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/814760 (https://phabricator.wikimedia.org/T311493) [11:18:19] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 89 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:19:03] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 67741 bytes in 3.322 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:19:06] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db216[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/814760 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:19:23] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 239 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:21:33] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [11:21:47] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 768 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:24:03] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115218 bytes in 0.235 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [11:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P31325 and previous config saved to /var/cache/conftool/dbconfig/20220718-112422-ladsgroup.json [11:24:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 83 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:25:06] !log re-enable puppet post postgresql re-sync [11:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 68 probes of 684 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:30:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31326 and previous config saved to /var/cache/conftool/dbconfig/20220718-113020-marostegui.json [11:32:41] PROBLEM - puppet last run on an-worker1127 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bullseye [11:33:06] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bullseye [11:37:32] (03PS1) 10Marostegui: mariadb: Productionize db2167 [puppet] - 10https://gerrit.wikimedia.org/r/814765 (https://phabricator.wikimedia.org/T311493) [11:39:10] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) i have re-synced puppetdb, however we need to prevent this from happening again. It seems we can increase the wal_keep_size but it may b... [11:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312984)', diff saved to https://phabricator.wikimedia.org/P31327 and previous config saved to /var/cache/conftool/dbconfig/20220718-113927-ladsgroup.json [11:39:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:39:33] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:39:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:39:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T312984)', diff saved to https://phabricator.wikimedia.org/P31328 and previous config saved to /var/cache/conftool/dbconfig/20220718-113947-ladsgroup.json [11:39:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2167 [puppet] - 10https://gerrit.wikimedia.org/r/814765 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:44:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312984)', diff saved to https://phabricator.wikimedia.org/P31329 and previous config saved to /var/cache/conftool/dbconfig/20220718-114454-ladsgroup.json [11:45:01] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31330 and previous config saved to /var/cache/conftool/dbconfig/20220718-114525-marostegui.json [11:46:13] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:29] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1687 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [11:47:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [11:50:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [11:54:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20 [11:54:21] ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:58:27] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) [11:58:34] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) p:05Triage→03Medium [12:00:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P31331 and previous config saved to /var/cache/conftool/dbconfig/20220718-115959-ladsgroup.json [12:00:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31332 and previous config saved to /var/cache/conftool/dbconfig/20220718-120030-marostegui.json [12:00:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:00:36] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [12:00:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:00:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31333 and previous config saved to /var/cache/conftool/dbconfig/20220718-120051-marostegui.json [12:01:22] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31334 and previous config saved to /var/cache/conftool/dbconfig/20220718-120157-marostegui.json [12:02:30] (03PS1) 10Sergio Gimeno: Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) [12:03:38] PROBLEM - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2028.codfw.wmnet with OS bullseye [12:04:46] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bullseye completed: - ganeti2028 (**PASS**) - Downtimed on... [12:04:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: retry apt in provision.sh [puppet] - 10https://gerrit.wikimedia.org/r/814719 (owner: 10Filippo Giunchedi) [12:04:57] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: validate host fqdn during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/814720 (owner: 10Filippo Giunchedi) [12:05:00] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: support to set/override domain during provisioning [puppet] - 10https://gerrit.wikimedia.org/r/814721 (owner: 10Filippo Giunchedi) [12:05:23] (03PS2) 10Filippo Giunchedi: pontoon: validate host fqdn during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/814720 [12:05:26] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: validate host fqdn during bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/814720 (owner: 10Filippo Giunchedi) [12:05:41] (03PS2) 10Filippo Giunchedi: pontoon: support to set/override domain during provisioning [puppet] - 10https://gerrit.wikimedia.org/r/814721 [12:05:43] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: support to set/override domain during provisioning [puppet] - 10https://gerrit.wikimedia.org/r/814721 (owner: 10Filippo Giunchedi) [12:09:58] (03PS1) 10Filippo Giunchedi: aptrepo: upgrade to grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814791 [12:10:11] I'm seeking reviewers for an easy one ^ [12:10:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 55 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:11:36] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:12:35] (03PS1) 10Filippo Giunchedi: smokeping: remove sampled hosts, probed by Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/814792 (https://phabricator.wikimedia.org/T169860) [12:12:56] 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Aklapper) @soworu: For future reference, please add project tags so someone could find this task, and please follow the instructions pointing to https://phabricator.wikimedia.org/proj... [12:13:00] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10MoritzMuehlenhoff) I retried the ganeti2028 reimage and everything works fine again, thanks! [12:13:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2012.codfw.wmnet with OS bullseye [12:13:20] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS bullseye [12:14:24] 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Aklapper) [12:14:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/814791 (owner: 10Filippo Giunchedi) [12:14:54] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 115233 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [12:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P31335 and previous config saved to /var/cache/conftool/dbconfig/20220718-121504-ladsgroup.json [12:16:12] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: upgrade to grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814791 (owner: 10Filippo Giunchedi) [12:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31336 and previous config saved to /var/cache/conftool/dbconfig/20220718-121702-marostegui.json [12:19:39] 10SRE, 10Cloud-Services: Update Grafana on cloudmetrics* to 8.x - https://phabricator.wikimedia.org/T313219 (10MoritzMuehlenhoff) [12:20:10] (03PS1) 10Filippo Giunchedi: aptrepo: actually update to Grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814794 [12:20:44] (03CR) 10CI reject: [V: 04-1] aptrepo: actually update to Grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814794 (owner: 10Filippo Giunchedi) [12:21:44] (03PS2) 10Filippo Giunchedi: aptrepo: actually update to Grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814794 [12:22:04] 10SRE, 10Cloud-Services: Update Grafana on cloudmetrics* to 8.x - https://phabricator.wikimedia.org/T313219 (10MoritzMuehlenhoff) 05Open→03Invalid Never mind, I missed that only cloudmetrics1001/1002 are running 7.x (which are only using role::spare::system, so possibly up for decom). [12:22:26] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: actually update to Grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814794 (owner: 10Filippo Giunchedi) [12:23:25] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: actually update to Grafana 8.5 [puppet] - 10https://gerrit.wikimedia.org/r/814794 (owner: 10Filippo Giunchedi) [12:26:13] (03CR) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [12:29:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2012.codfw.wmnet with reason: host reimage [12:30:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312984)', diff saved to https://phabricator.wikimedia.org/P31337 and previous config saved to /var/cache/conftool/dbconfig/20220718-123009-ladsgroup.json [12:30:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:30:13] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [12:30:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T312984)', diff saved to https://phabricator.wikimedia.org/P31338 and previous config saved to /var/cache/conftool/dbconfig/20220718-123029-ladsgroup.json [12:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31339 and previous config saved to /var/cache/conftool/dbconfig/20220718-123207-marostegui.json [12:33:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2012.codfw.wmnet with reason: host reimage [12:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312984)', diff saved to https://phabricator.wikimedia.org/P31340 and previous config saved to /var/cache/conftool/dbconfig/20220718-123433-ladsgroup.json [12:35:39] !log update grafana to 8.5.9 [12:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] (03PS6) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:38:05] (03PS1) 10Filippo Giunchedi: aptrepo: upgrade Grafana to 8.5 (#3) [puppet] - 10https://gerrit.wikimedia.org/r/814796 [12:38:58] (03PS2) 10Filippo Giunchedi: aptrepo: upgrade Grafana to 8.5 (#3) [puppet] - 10https://gerrit.wikimedia.org/r/814796 [12:40:43] (03PS1) 10Matthias Mullie: Use getOption to detect user preferences [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814767 (https://phabricator.wikimedia.org/T313209) [12:40:52] (03CR) 10Matthias Mullie: [C: 03+1] Use getOption to detect user preferences [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814767 (https://phabricator.wikimedia.org/T313209) (owner: 10Matthias Mullie) [12:41:08] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:41:28] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: upgrade Grafana to 8.5 (#3) [puppet] - 10https://gerrit.wikimedia.org/r/814796 (owner: 10Filippo Giunchedi) [12:44:40] (03PS7) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:46:57] (03CR) 10Ayounsi: [C: 03+1] smokeping: remove sampled hosts, probed by Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/814792 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31341 and previous config saved to /var/cache/conftool/dbconfig/20220718-124712-marostegui.json [12:47:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:47:18] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [12:47:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31342 and previous config saved to /var/cache/conftool/dbconfig/20220718-124732-marostegui.json [12:47:37] (03PS1) 10David Caro: wmcs.novafullstack: Remove nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/814798 [12:48:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31343 and previous config saved to /var/cache/conftool/dbconfig/20220718-124838-marostegui.json [12:49:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2012.codfw.wmnet with OS bullseye [12:49:13] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS bullseye completed: - ganeti2012 (**PASS**) - Downtimed on... [12:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P31344 and previous config saved to /var/cache/conftool/dbconfig/20220718-124938-ladsgroup.json [12:51:09] (03CR) 10Kosta Harlan: [C: 03+1] Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [12:51:57] (03CR) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:56:52] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove sampled hosts, probed by Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/814792 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:59:17] (03PS2) 10David Caro: wmcs.novafullstack: Remove nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/814798 [13:00:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2018.codfw.wmnet with OS bullseye [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T1300). [13:00:04] Daimona, cormacparle, and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:10] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS bullseye [13:00:13] o/ [13:00:29] jouncebot: I thought the functions had arguments? [13:00:32] o/ [13:00:56] alright, I can deploy [13:01:17] RECOVERY - Check systemd state on thanos-be2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:29] Thanks Lucas [13:01:38] uh, not sure if I know how to create db tables on beta though [13:01:49] Neither do I! Let the fun begin [13:02:01] I was going to ask if it's somehting I could do myself [13:02:11] It's been so long since I did a deployment I'd forgotten which channel I'm supposed to be in [13:02:23] Lucas_WMDE: I see there's a "Max 6 patches" notice on the window, and we're already over that, but can you deploy 2 more that are actually one (add and use logos for brwikimedia)? if you can't that's fine :) [13:02:35] we’ll see [13:02:45] okay :) [13:02:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use getOption to detect user preferences [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814767 (https://phabricator.wikimedia.org/T313209) (owner: 10Matthias Mullie) [13:03:02] let’s start by +2ing the backport, since that’ll take a bit in gate-and-submit [13:03:13] cormacparle: are you theres [13:03:14] *there? [13:03:23] I am here [13:03:29] ok, then let’s start with those config changes [13:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31345 and previous config saved to /var/cache/conftool/dbconfig/20220718-130343-marostegui.json [13:03:47] (03PS2) 10Lucas Werkmeister (WMDE): Update config for commons custommatch search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814108 (owner: 10Cparle) [13:03:53] cool [13:04:43] Lucas_WMDE: that backport doesn't need to go to mwdebug, can be synced right away [13:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P31346 and previous config saved to /var/cache/conftool/dbconfig/20220718-130443-ladsgroup.json [13:04:52] ok [13:04:55] Anyone here familiar with how to create DB tables on beta? Without breaking the world, that is. [13:04:57] (03PS1) 10David Caro: wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 [13:05:25] Daimona: think of the sticker you can get! [13:05:59] I ain't doin' that for a sticker, must be a tshirt at least! [13:06:21] (03CR) 10CI reject: [V: 04-1] wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 (owner: 10David Caro) [13:06:40] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) p:05Triage→03Medium cc @Ottomata || @odimitrijevic for analytics-privatedata-users approval as of data.yaml [13:06:44] (03PS1) 10Ayounsi: Interface description: handle one more patch panel special case [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/814803 (https://phabricator.wikimedia.org/T304710) [13:06:45] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) a:03Vgutierrez [13:07:01] Daimona: beta runs update.php once an hour for all wikis so if the tables are wired up they'll get created automatically [13:07:18] They're not, because we need them in wikishared and not in the local wiki DB :) [13:07:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update config for commons custommatch search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814108 (owner: 10Cparle) [13:07:29] We don't like easy stuff. [13:08:18] ah, then connect to a deployment-mwmaint host, and manually create them on a mysql shell (`sql wikishared --write`) [13:08:20] (03Merged) 10jenkins-bot: Update config for commons custommatch search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814108 (owner: 10Cparle) [13:09:03] cormacparle: the custommatch change is live on mwdebug1001, can you test it? [13:09:10] sure [13:09:10] What are the beta mwmaint hosts? I'm not sure if I have access :D [13:09:38] of course it’s sql and not mwscript mysql.php $uhhIDontKnowWhichWiki [13:10:05] Lol [13:10:18] `sql` is smart and has a special-case for wikishared :-) [13:10:25] * Lucas_WMDE loves special cases [13:10:42] https://openstack-browser.toolforge.org/project/deployment-prep says you have access and the host you want is deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud [13:10:48] (03Merged) 10jenkins-bot: Use getOption to detect user preferences [extensions/ImageSuggestions] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814767 (https://phabricator.wikimedia.org/T313209) (owner: 10Matthias Mullie) [13:10:52] (03PS2) 10Ayounsi: Interface description: handle one more patch panel special case [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/814803 (https://phabricator.wikimedia.org/T304710) [13:11:06] Lucas_WMDE: that custommatch change looks good [13:11:11] (03PS2) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [13:11:14] taavi: thanks, let me try [13:11:17] alright, syncing that [13:11:56] Daimona: please don’t make changes while I’m supposed to be responsible for the backport+config window… [13:12:27] Don't worry, just trying to see if I can access that host. I may need it for the future. [13:12:30] (03CR) 10Ayounsi: "Example diff before:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/814803 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [13:12:47] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The following units failed: smokeping.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:21] alright, added `sql wikishared --write` to some deployment-prep wikitech docs [13:14:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814108|Update config for commons custommatch search]] (duration: 02m 55s) [13:15:30] alright, matthiasmullie’s backport is next [13:15:46] great [13:16:28] i simplified the docs a bit too [13:16:28] (03PS1) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [13:16:28] (Confirming that I have access to that, thanks taavi, I wrote it down so I won't have to bother you or someone else next time) [13:16:46] (03CR) 10CI reject: [V: 04-1] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [13:17:25] (03PS2) 10Lucas Werkmeister (WMDE): Make weighted_tags search default for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814111 (owner: 10Cparle) [13:18:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [13:18:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2018.codfw.wmnet with reason: host reimage [13:18:31] (03CR) 10Hashar: "This is merely non sense coming from https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master "Genera" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [13:18:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31347 and previous config saved to /var/cache/conftool/dbconfig/20220718-131848-marostegui.json [13:19:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/ImageSuggestions/maintenance/SendNotificationsForUnillustratedWatchedTitles.php: Backport: [[gerrit:814767|Use getOption to detect user preferences (T313209)]] (duration: 02m 50s) [13:19:34] T313209: SendNotificationsForUnillustratedWatchedTitles does not consider GlobalPreferences - https://phabricator.wikimedia.org/T313209 [13:19:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312984)', diff saved to https://phabricator.wikimedia.org/P31348 and previous config saved to /var/cache/conftool/dbconfig/20220718-131949-ladsgroup.json [13:19:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:19:52] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:20:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:20:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312984)', diff saved to https://phabricator.wikimedia.org/P31349 and previous config saved to /var/cache/conftool/dbconfig/20220718-132009-ladsgroup.json [13:20:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "`git show --patience --color-moved=dimmed-zebra` nicely shows that this only moves a block of code around without changing anything inside" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814111 (owner: 10Cparle) [13:21:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2018.codfw.wmnet with reason: host reimage [13:21:48] (03Merged) 10jenkins-bot: Make weighted_tags search default for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814111 (owner: 10Cparle) [13:22:15] Lucas_WMDE: yes one of those patches is just moving a block of code from one place to another ... the first search config in the list is the default, so it's changing the default search mechanism for commons [13:22:24] yup, makes sense [13:22:36] just wanted to confirm, and I always like spreading awareness of the --color-moved option ;) [13:22:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:22:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:55] cormacparle: alright, that change is on mwdebug1001, can you test it? [13:22:55] heh cool, I wasn't aware of it before! [13:22:59] sure [13:24:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312984)', diff saved to https://phabricator.wikimedia.org/P31350 and previous config saved to /var/cache/conftool/dbconfig/20220718-132411-ladsgroup.json [13:25:00] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10Volans) Replication slots seems more interesting and tailored on what we need here as far as I can tell from a quick look. Thanks for opening this. [13:25:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:58] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) [13:26:40] taavi: do you know if it’s still the case that Beta SAL messages should be logged in -releng instead of using !log deployment-prep in -cloud? [13:27:01] no clue [13:27:01] that’s what I heard a few years ago but the latest entries on https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL don’t directly look related [13:27:08] alright [13:27:09] Lucas_WMDE: the commons config change looks good [13:27:11] then I’ll just go with that ^^ [13:27:13] thanks cormacparle [13:27:49] syncing [13:28:11] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [13:28:31] (03PS1) 10Jbond: P:postgress::database: add docs and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/814809 (https://phabricator.wikimedia.org/T313217) [13:28:33] (03PS1) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [13:30:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814111|Make weighted_tags search default for commonswiki]] (duration: 02m 54s) [13:30:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2028.codfw.wmnet [13:31:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [13:31:38] alright, I’ll take a closer look at the beta changes now [13:32:07] (03PS2) 10David Caro: wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 [13:32:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:05] (03CR) 10CI reject: [V: 04-1] C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [13:33:49] (03CR) 10Lucas Werkmeister (WMDE): "Isn’t the addition to CommonSettings-labs.php redundant? It looks like most other extensions are only loaded via CommonSettings.php, as fa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:33:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31351 and previous config saved to /var/cache/conftool/dbconfig/20220718-133354-marostegui.json [13:33:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:34:00] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [13:34:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:34:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T313070)', diff saved to https://phabricator.wikimedia.org/P31352 and previous config saved to /var/cache/conftool/dbconfig/20220718-133414-marostegui.json [13:35:23] (03CR) 10Daimona Eaytoy: Load and configure the CampaignEvents extension where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:35:52] (03PS2) 10Daimona Eaytoy: Load and configure the CampaignEvents extension where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) [13:36:43] (03CR) 10Lucas Werkmeister (WMDE): Load and configure the CampaignEvents extension where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:37:01] okay, so the command to create the tables would be (on deployment-deploy03): [13:37:03] `sql wikishared --write < /srv/mediawiki-staging/php-master/extensions/CampaignEvents/db_patches/mysql/tables-generated.sql` [13:37:10] does that look okay Daimona taavi? [13:38:12] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The following units failed: smokeping.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:49] (03CR) 10Daimona Eaytoy: Load and configure the CampaignEvents extension where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:38:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [13:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P31353 and previous config saved to /var/cache/conftool/dbconfig/20220718-133916-ladsgroup.json [13:39:38] (03CR) 10Lucas Werkmeister (WMDE): Load and configure the CampaignEvents extension where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:40:00] anyone wanna ack my SQL command above? ^^ [13:40:00] Looks OK [13:40:04] ok thanks :) [13:40:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2018.codfw.wmnet with OS bullseye [13:40:23] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS bullseye completed: - ganeti2018 (**PASS**) - Downtimed on... [13:40:55] seems to have worked [13:41:15] the table appears in DESCRIBE, including in a non-`--write` command (which I hope connects to a replica and indicates that the table creation replicated properly) [13:41:17] Yay! [13:42:17] also, php-1.39..0-wmf.19/extensions/CampaignEvents/ exists on deploy1002 (prod), so that looks fine as well [13:42:25] I think we can go ahead with the config changes [13:42:29] (03PS2) 10Lucas Werkmeister (WMDE): Add CampaignEvents to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813986 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:42:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add CampaignEvents to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813986 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:43:31] (03Merged) 10jenkins-bot: Add CampaignEvents to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813986 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:44:28] not sure if extension-list needs to be scapped in prod but let’s just do it [13:44:37] and I think I might also do a sync-world at the end [13:44:41] just in case [13:45:02] (if anything I say sounds like a bad idea, do let me know ^^) [13:45:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:45:10] Yeah, I was also trying to find out if that's the case [13:45:20] (03PS2) 10Lucas Werkmeister (WMDE): Add config variable for the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813989 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:46:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:46:27] (03PS2) 10Eevans: [DRAFT]: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [13:47:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:47:59] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:813986|Add CampaignEvents to extension-list (T311752)]] (duration: 03m 08s) [13:48:00] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:04] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [13:48:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add config variable for the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813989 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:48:58] (03Merged) 10jenkins-bot: Add config variable for the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813989 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:50:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:50:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:50:06] (03PS2) 10Jbond: P:postgress::database: add docs and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/814809 (https://phabricator.wikimedia.org/T313217) [13:50:39] (03PS2) 10Lucas Werkmeister (WMDE): Enable the CampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813990 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:51:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:51:06] PROBLEM - Check systemd state on mw1383 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:813989|Add config variable for the CampaignEvents extension (T311752)]] (no-op) (duration: 02m 55s) [13:53:23] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [13:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P31354 and previous config saved to /var/cache/conftool/dbconfig/20220718-135421-ladsgroup.json [13:54:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Note that the extension won’t *actually* be enabled until change I48805455fc wires up CommonSettings(-labs).php to load and configure the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813990 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:55:16] jouncebot: next [13:55:16] In 1 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T1530) [13:55:23] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813990 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:55:29] okay, I think we’ll run over a bit but it should be okay [13:56:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:57:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:57:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:57:14] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [13:57:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:58:27] (03PS3) 10Lucas Werkmeister (WMDE): Load and configure the CampaignEvents extension where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [13:58:51] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:813990|Enable the CampaignEvents extension on beta (T311752)]] (no-op) (duration: 02m 43s) [13:58:54] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [14:00:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load and configure the CampaignEvents extension where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [14:00:19] (backport+config window continues, for the record) [14:01:56] (03Merged) 10jenkins-bot: Load and configure the CampaignEvents extension where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813991 (https://phabricator.wikimedia.org/T311752) (owner: 10Daimona Eaytoy) [14:02:26] Daimona: I’ve pulled the last change to mwdebug1001, can you quickly check that the extension definitely isn’t enabled in production? [14:02:41] perhaps load some special page it would provide, or something [14:02:43] \o/ sure [14:03:03] I can’t see it in Wikidata’s Special:Version which is already a good sign [14:03:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:04:14] Yup, can't see it in prod [14:04:58] ok thanks [14:07:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:07:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:08:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:813991|Load and configure the CampaignEvents extension where enabled (T311752)]] (1/2: should be no-op) (duration: 02m 51s) [14:08:34] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [14:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312984)', diff saved to https://phabricator.wikimedia.org/P31355 and previous config saved to /var/cache/conftool/dbconfig/20220718-140926-ladsgroup.json [14:09:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1128.eqiad.wmnet with reason: Maintenance [14:09:32] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:09:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1128.eqiad.wmnet with reason: Maintenance [14:09:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312984)', diff saved to https://phabricator.wikimedia.org/P31356 and previous config saved to /var/cache/conftool/dbconfig/20220718-140947-ladsgroup.json [14:09:54] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:11:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:11:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:813991|Load and configure the CampaignEvents extension where enabled (T311752)]] (2/2: should be prod no-op) (duration: 02m 40s) [14:12:53] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/814820 [14:12:56] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:13:24] Daimona: it looks like it should be enabled in Beta by now [14:13:30] (https://integration.wikimedia.org/ci/job/beta-scap-sync-world/60217/console) [14:13:34] Yup, it's there! [14:13:37] yay [14:13:45] I’ll do a final sync-world in production just to make sure everything’s clean [14:13:52] Noice! [14:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312984)', diff saved to https://phabricator.wikimedia.org/P31357 and previous config saved to /var/cache/conftool/dbconfig/20220718-141354-ladsgroup.json [14:13:55] Thanks again. [14:13:57] since I’m not sure the extension’s i18n would’ve been built when it was in wmf.19 but not yet in extension-list [14:13:58] np [14:14:32] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:14:47] !log lucaswerkmeister-wmde@deploy1002 Started scap: refresh everything after adding CampaignEvents to extension-list (T311752, only enabled in Beta so far), just in case [14:14:52] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [14:16:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:18:42] Lucas_WMDE: looked like a big backport window today, nice work---thank you <3 [14:18:52] you’re welcome :) [14:18:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:18:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:19:24] Tks4Fish: sorry we had no time for your patches today, please schedule them another time [14:19:39] ah, I see you already did, for the late window [14:20:11] So is it officially over? [14:20:35] not until the sync-world finishes, but I’m not going to do more patches after it [14:20:42] (currently at sync-proxies 87% btw) [14:21:21] Yeah, just asking because I didn't want to open the champagne before 100% [14:21:35] :D [14:22:04] if champagne is for v0 on beta, how are you going to celebrate production deployment? ;) [14:22:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:22:50] champagne, but with an extra 0 on the price tag [14:22:59] fair enough [14:23:29] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [14:23:54] Well, guess I'll be happy with an iced tea for now. [14:24:07] 93% sync-apaches [14:24:13] it should be almost done [14:24:46] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [14:25:09] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [14:25:15] Oh, > 90%, that's when things usually start breaking [14:25:22] oh wow it tells me the rsync transfer was a total of 900 gigabytes [14:25:33] (average some 2½ gigs per host) [14:25:38] o_0 [14:26:46] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) p:05Triage→03Medium [14:27:09] I suspect this is the total file size, not the amount of transferred data [14:27:35] earlier on mwdebug the scap pull reported 280k files, 10 GB total file size, but just 1 GB bytes transferred [14:27:44] hm [14:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P31358 and previous config saved to /var/cache/conftool/dbconfig/20220718-142859-ladsgroup.json [14:29:00] ^ dancy that seems like...a lot [14:29:27] !log lucaswerkmeister-wmde@deploy1002 Finished scap: refresh everything after adding CampaignEvents to extension-list (T311752, only enabled in Beta so far), just in case (duration: 14m 40s) [14:29:30] I can keep the terminal window open if you want the full outputs [14:29:31] T311752: Release V0 of the CampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T311752 [14:29:39] !log UTC afternoon backport+config window done [14:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:42] of transfer for a sync-world that was a cleanup, (is that right Lucas_WMDE -- no new l10n expected?) [14:29:49] yes [14:29:57] hmm? [14:30:11] I have a suspicion that there's some process that's invalidating l10n when it's uneeded [14:30:13] the l10n rebuild finished in less than a minute too [14:30:50] whether that's in scap or on the deployment server: I'm unsure. But rsync seems to be syncing all l10n on each sync-world (or the last few I've done) [14:30:57] hm [14:31:19] thcipriani: Even for a back-to-back run? [14:32:10] dancy: I only did one, and it didn't do it for back-to-back runs last I checked (a couple weeks ago) but when I try it at the beginning of the backport window it seems to happen every time [14:32:39] which makes me think it could be some automated process on the deployment box somewhere [14:32:55] the good news: everything works correctly; the bad news: takes a long time [14:32:57] There aren't any such processes that I'm aware of. [14:33:12] yeah, same [14:33:40] the new l10n backports are only for REL1_ branches, not wmf. ones, right? [14:33:46] (now-ish) [14:33:50] (*new-ish) [14:34:01] (03PS2) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [14:34:05] otherwise that might be an explanation [14:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T313070)', diff saved to https://phabricator.wikimedia.org/P31359 and previous config saved to /var/cache/conftool/dbconfig/20220718-143428-marostegui.json [14:34:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/814803 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [14:34:34] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [14:36:21] (03PS1) 10Jbond: O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) [14:41:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36282/console" [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [14:41:55] (03CR) 10CI reject: [V: 04-1] C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [14:42:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [14:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P31360 and previous config saved to /var/cache/conftool/dbconfig/20220718-144404-ladsgroup.json [14:45:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:46:39] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [14:47:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31361 and previous config saved to /var/cache/conftool/dbconfig/20220718-144934-marostegui.json [14:50:32] (03PS1) 10Jbond: test reverting storconfig change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/814826 [14:51:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [14:51:30] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:52:30] (03CR) 10Andrea Denisse: [C: 03+1] profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:52:34] (03CR) 10CI reject: [V: 04-1] test reverting storconfig change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/814826 (owner: 10Jbond) [14:53:55] (03PS3) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [14:53:57] (03PS1) 10Ayounsi: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 [14:55:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2012.codfw.wmnet to cluster codfw and group C [14:56:12] (03CR) 10Ayounsi: "This returns error:" [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi) [14:56:57] (03PS4) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [14:58:21] (03CR) 10Dzahn: "I am just trying to avoid paging the entire SRE team. When switching to new types of monitoring there often is a false positive or some fo" [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:58:24] (03PS5) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) [14:59:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2012.codfw.wmnet to cluster codfw and group C [14:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312984)', diff saved to https://phabricator.wikimedia.org/P31362 and previous config saved to /var/cache/conftool/dbconfig/20220718-145909-ladsgroup.json [14:59:11] (03CR) 10Ahmon Dancy: safe-service-restart.py: Avoid uninitialized access to 'status' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [14:59:15] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:59:47] (03PS1) 10Jbond: create_puppetconf no longer takes the directory parameter [puppet] - 10https://gerrit.wikimedia.org/r/814832 [15:00:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] create_puppetconf no longer takes the directory parameter [puppet] - 10https://gerrit.wikimedia.org/r/814832 (owner: 10Jbond) [15:00:41] (03CR) 10CI reject: [V: 04-1] Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi) [15:02:08] (03PS5) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [15:03:05] (03CR) 10Jbond: [C: 03+2] P:postgress::database: add docs and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/814809 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [15:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31363 and previous config saved to /var/cache/conftool/dbconfig/20220718-150439-marostegui.json [15:08:17] (03CR) 10CI reject: [V: 04-1] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [15:08:51] (03PS2) 10Muehlenhoff: tcpircbot: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811231 (https://phabricator.wikimedia.org/T308013) [15:08:58] (03PS3) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [15:09:00] (03PS2) 10Jbond: O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) [15:09:30] (03PS4) 10Jbond: C:postgress::server: add replication slot support [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) [15:09:36] (03PS3) 10Jbond: O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) [15:10:29] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [15:10:56] (03CR) 10CI reject: [V: 04-1] O:puppetdb: enable postgress slots for replication [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [15:11:22] (03CR) 10Muehlenhoff: [C: 03+2] tcpircbot: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811231 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:13:26] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Gehel) [15:14:03] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Gehel) [15:17:03] (03PS1) 10Bartosz Dziewoński: Ensure custom locales for Moment.js overrides, don't change 'en' [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814769 (https://phabricator.wikimedia.org/T313188) [15:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T313070)', diff saved to https://phabricator.wikimedia.org/P31364 and previous config saved to /var/cache/conftool/dbconfig/20220718-151944-marostegui.json [15:19:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:19:49] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [15:19:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [15:19:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:20:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:20:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:20:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:20:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31365 and previous config saved to /var/cache/conftool/dbconfig/20220718-152026-marostegui.json [15:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31366 and previous config saved to /var/cache/conftool/dbconfig/20220718-152132-marostegui.json [15:25:37] (03PS2) 10Ayounsi: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 [15:25:39] (03PS6) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [15:25:41] (03PS1) 10Ayounsi: Workaround mypy type error on pyyaml [software/homer] - 10https://gerrit.wikimedia.org/r/814839 [15:29:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [15:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T1530). [15:32:09] 10SRE, 10ops-codfw, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10MPhamWMF) [15:33:54] 10SRE, 10ops-codfw, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Gehel) a:05Papaul→03bking We have enough over capacity in that cluster, and this server should be scheduled for refresh next year. Le... [15:35:14] (03CR) 10PleaseStand: Add change_templatelinks_pk.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) (owner: 10Ladsgroup) [15:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31367 and previous config saved to /var/cache/conftool/dbconfig/20220718-153637-marostegui.json [15:36:53] (03Abandoned) 10David Caro: DONOTMERGE: skeleteon for the replicaconfig service [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [15:37:29] (03Abandoned) 10David Caro: novafullstack: allow running on codfw [puppet] - 10https://gerrit.wikimedia.org/r/811318 (owner: 10David Caro) [15:38:00] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Interface description: handle one more patch panel special case [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/814803 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [15:40:12] (03CR) 10Ladsgroup: [C: 03+2] Add change_templatelinks_pk.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814729 (https://phabricator.wikimedia.org/T312863) (owner: 10Ladsgroup) [15:41:28] (03PS1) 10Ladsgroup: change_templatelinks_pk: Fix check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814844 [15:42:21] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814846 (https://phabricator.wikimedia.org/T128546) [15:44:18] (03CR) 10Ladsgroup: [C: 03+2] change_templatelinks_pk: Fix check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814844 (owner: 10Ladsgroup) [15:44:29] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814846 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:45:08] (03Merged) 10jenkins-bot: change_templatelinks_pk: Fix check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814844 (owner: 10Ladsgroup) [15:45:28] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814846 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:46:04] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8069141, @fgiunchedi wrote: > Overall the idea of sending additional headers is the right one @Jgianne... [15:48:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:49:44] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:814846| Bumping portals to master (T128546)]] (duration: 03m 03s) [15:49:48] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:51:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:51:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31368 and previous config saved to /var/cache/conftool/dbconfig/20220718-155143-marostegui.json [15:52:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:52:43] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:814846| Bumping portals to master (T128546)]] (duration: 02m 59s) [15:53:20] (03PS1) 10Andrea Denisse: netmon: Add suppport for multiple backup/passive nodes in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) [15:54:20] (03PS24) 10Ayounsi: Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 [15:54:22] (03PS2) 10Ayounsi: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 [15:54:36] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:45] (03CR) 10Ayounsi: "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [15:55:21] (03PS1) 10Filippo Giunchedi: smokeping: fix targets configuration for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/814849 (https://phabricator.wikimedia.org/T169860) [15:55:47] (03CR) 10CI reject: [V: 04-1] smokeping: fix targets configuration for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/814849 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [15:56:13] (03PS2) 10Filippo Giunchedi: smokeping: fix targets configuration for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/814849 (https://phabricator.wikimedia.org/T169860) [15:57:43] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: fix targets configuration for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/814849 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [16:02:31] (03PS1) 10Ahmon Dancy: Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) [16:05:29] (03CR) 10CI reject: [V: 04-1] Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy) [16:06:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T313070)', diff saved to https://phabricator.wikimedia.org/P31369 and previous config saved to /var/cache/conftool/dbconfig/20220718-160648-marostegui.json [16:06:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:06:53] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [16:07:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31370 and previous config saved to /var/cache/conftool/dbconfig/20220718-160708-marostegui.json [16:07:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31371 and previous config saved to /var/cache/conftool/dbconfig/20220718-160813-marostegui.json [16:09:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:09:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:12:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:17:03] (03PS1) 10Ebernhardson: reindex: Detect index type from live mappings [extensions/CirrusSearch] (wmf/1.39.0-wmf.20) - 10https://gerrit.wikimedia.org/r/814770 [16:17:14] (03PS1) 10Ebernhardson: reindex: Detect index type from live mappings [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814771 [16:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31372 and previous config saved to /var/cache/conftool/dbconfig/20220718-162319-marostegui.json [16:24:22] (03PS2) 10Ahmon Dancy: Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) [16:28:01] (03CR) 10CI reject: [V: 04-1] Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy) [16:28:14] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Technically we can do this (although it wasn't very trivial from a quick look at the s3 go sdk). Maybe its worth revi... [16:29:21] 10SRE, 10Security-Team, 10WMF-Legal, 10SecTeam-Processed, and 2 others: T166179 has attachments that perhaps shouldn't have been made public - https://phabricator.wikimedia.org/T313125 (10sbassett) No need to directly engage #wmf-legal on this. The issue appears to be resolved by @RobH, so making this tas... [16:29:27] 10SRE, 10Security-Team, 10WMF-Legal, 10SecTeam-Processed, and 2 others: T166179 has attachments that perhaps shouldn't have been made public - https://phabricator.wikimedia.org/T313125 (10sbassett) [16:29:35] 10SRE, 10Security-Team, 10WMF-Legal, 10SecTeam-Processed, and 2 others: T166179 has attachments that perhaps shouldn't have been made public - https://phabricator.wikimedia.org/T313125 (10sbassett) p:05Triage→03Low a:03RobH [16:30:02] 10SRE, 10Security-Team, 10WMF-Legal, 10SecTeam-Processed, and 2 others: T166179 has attachments that perhaps shouldn't have been made public - https://phabricator.wikimedia.org/T313125 (10sbassett) 05Open→03Resolved [16:31:11] (03CR) 10Ahmon Dancy: "Not sure what to do about the various CI errors. I don't think they're related to the changes I made." [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy) [16:37:38] (03PS1) 10Majavah: openstack: wmcs-image-create: adapt for systemd based puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/814857 [16:37:50] (03PS2) 10Majavah: openstack: wmcs-image-create: adapt for systemd based puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/814857 [16:38:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31373 and previous config saved to /var/cache/conftool/dbconfig/20220718-163824-marostegui.json [16:52:20] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [16:53:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T313070)', diff saved to https://phabricator.wikimedia.org/P31374 and previous config saved to /var/cache/conftool/dbconfig/20220718-165329-marostegui.json [16:53:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:53:36] T313070: Adjust the field type of wb_changes.change_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T313070 [16:53:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31375 and previous config saved to /var/cache/conftool/dbconfig/20220718-165349-marostegui.json [16:54:02] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:54:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T313070)', diff saved to https://phabricator.wikimedia.org/P31376 and previous config saved to /var/cache/conftool/dbconfig/20220718-165455-marostegui.json [16:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31377 and previous config saved to /var/cache/conftool/dbconfig/20220718-165617-root.json [16:56:30] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T1700). [17:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31378 and previous config saved to /var/cache/conftool/dbconfig/20220718-171122-root.json [17:11:45] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add a no-op userid hash generator [puppet] - 10https://gerrit.wikimedia.org/r/812403 (owner: 10Andrew Bogott) [17:21:31] (03CR) 10Andrew Bogott: [C: 03+2] Remove cloudstore100[89] IPs from the dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/810351 (https://phabricator.wikimedia.org/T311844) (owner: 10Andrew Bogott) [17:23:10] (03PS2) 10David Caro: wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 [17:25:44] (03CR) 10CI reject: [V: 04-1] wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [17:26:17] (03PS3) 10David Caro: wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 [17:26:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31379 and previous config saved to /var/cache/conftool/dbconfig/20220718-172626-root.json [17:28:39] (03CR) 10CI reject: [V: 04-1] wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [17:30:20] 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10Aklapper) [17:33:11] (03PS3) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) [17:41:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31380 and previous config saved to /var/cache/conftool/dbconfig/20220718-174130-root.json [17:43:08] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2065.codfw.wmnet with OS bullseye [17:47:47] (03CR) 10Raymond Ndibe: [C: 03+1] "Not enough context to +2, so I'll just +1" [puppet] - 10https://gerrit.wikimedia.org/r/813275 (owner: 10David Caro) [17:51:38] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:54:08] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31381 and previous config saved to /var/cache/conftool/dbconfig/20220718-175634-root.json [17:56:38] (03PS1) 10Jdlrobson: Collapse sidebar by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814865 (https://phabricator.wikimedia.org/T287609) [17:57:00] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2065.codfw.wmnet with reason: host reimage [17:59:19] (03PS29) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [17:59:21] (03PS1) 10Jbond: beaker: add a method to hack fixes specific to beaker [puppet] - 10https://gerrit.wikimedia.org/r/814866 [18:02:10] (03PS1) 10Jdlrobson: Enable language switching button for logged-out users on non-pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) [18:02:12] (03PS1) 10Jdlrobson: Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) [18:02:14] (03PS1) 10Jdlrobson: Deploy the new grid layout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [18:02:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2065.codfw.wmnet with reason: host reimage [18:04:48] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (The Decommission Mission 💀): replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10thcipriani) [18:05:38] (03CR) 10Herron: logstash: enable pipeline-managed index patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [18:07:37] (03CR) 10Herron: [C: 03+1] profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [18:07:39] (03PS3) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) [18:08:20] (03CR) 10Dduvall: "Thanks for the review, Jelto. I believe I've addressed your comments." [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [18:08:43] (03PS1) 10Ebernhardson: Turn off ApiFeatureUsage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814870 (https://phabricator.wikimedia.org/T313248) [18:08:45] (03PS1) 10Ebernhardson: Remove references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) [18:08:56] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 (owner: 10Andrew Bogott) [18:09:07] (03PS4) 10Andrew Bogott: Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 [18:09:11] (03CR) 10Herron: [C: 03+1] hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [18:11:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31382 and previous config saved to /var/cache/conftool/dbconfig/20220718-181138-root.json [18:15:12] (03PS1) 10Ebernhardson: Remove unused wmgUseApiFeatureUsage config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814873 (https://phabricator.wikimedia.org/T313248) [18:16:27] (03CR) 10Raymond Ndibe: "noop question here. Saw the word "Icinga" in atleast two of the currently open patches and googled about it. I'd say that I get the idea o" [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [18:16:56] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2065.codfw.wmnet with OS bullseye [18:17:16] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [18:17:18] (03Abandoned) 10Ebernhardson: Turn off ApiFeatureUsage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814870 (https://phabricator.wikimedia.org/T313248) (owner: 10Ebernhardson) [18:17:56] (03PS2) 10Ebernhardson: Remove references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) [18:19:00] (03PS10) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [18:19:23] (03CR) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [18:24:07] (03CR) 10Jforrester: "You have to do these kinds of changes as two or three different patches for deploy safety; first disable in IS (nothing to do here), then " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) (owner: 10Ebernhardson) [18:24:18] (03PS3) 10Ebernhardson: Remove references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) [18:24:20] (03PS2) 10Ebernhardson: Remove unused wmgUseApiFeatureUsage config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814873 (https://phabricator.wikimedia.org/T313248) [18:26:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31384 and previous config saved to /var/cache/conftool/dbconfig/20220718-182642-root.json [18:27:26] (03CR) 10Ebernhardson: Remove references to ApiFeatureUsage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) (owner: 10Ebernhardson) [18:28:08] (03CR) 10Clare Ming: Deploy the new grid layout (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [18:29:46] (03CR) 10Clare Ming: [C: 03+1] Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [18:32:18] (03CR) 10Clare Ming: Enable language switching button for logged-out users on non-pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) (owner: 10Jdlrobson) [18:32:42] (03CR) 10Herron: logstash: duplicate alert logs for loki target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [18:35:24] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2066.codfw.wmnet with OS bullseye [18:36:07] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [18:39:53] (03PS1) 10Zabe: cassandra: Add SPDX headers to cassandra profile [puppet] - 10https://gerrit.wikimedia.org/r/814876 (https://phabricator.wikimedia.org/T308013) [18:39:55] (03PS1) 10Zabe: certspotter: Add SPDX headers to certspotter profile [puppet] - 10https://gerrit.wikimedia.org/r/814877 (https://phabricator.wikimedia.org/T308013) [18:39:57] (03PS1) 10Zabe: chartmuseum: Add SPDX headers to chartmuseum profile [puppet] - 10https://gerrit.wikimedia.org/r/814878 (https://phabricator.wikimedia.org/T308013) [18:39:59] (03PS1) 10Zabe: codesearch: Add SPDX headers to codesearch profile [puppet] - 10https://gerrit.wikimedia.org/r/814879 (https://phabricator.wikimedia.org/T308013) [18:40:01] (03PS1) 10Zabe: conftool: Add SPDX headers to conftool profile [puppet] - 10https://gerrit.wikimedia.org/r/814880 (https://phabricator.wikimedia.org/T308013) [18:40:03] (03PS1) 10Zabe: cumin: Add SPDX headers to cumin profile [puppet] - 10https://gerrit.wikimedia.org/r/814881 (https://phabricator.wikimedia.org/T308013) [18:40:07] (03PS1) 10Zabe: dbbackups: Add SPDX headers to dbbackups profile [puppet] - 10https://gerrit.wikimedia.org/r/814882 (https://phabricator.wikimedia.org/T308013) [18:40:09] (03PS1) 10Zabe: debdeploy: Add SPDX headers to debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/814883 (https://phabricator.wikimedia.org/T308013) [18:40:11] (03PS1) 10Zabe: diffscan: Add SPDX headers to diffscan profile [puppet] - 10https://gerrit.wikimedia.org/r/814884 (https://phabricator.wikimedia.org/T308013) [18:41:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31385 and previous config saved to /var/cache/conftool/dbconfig/20220718-184146-root.json [18:42:50] (03PS4) 10Ebernhardson: Remove references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814871 (https://phabricator.wikimedia.org/T313248) [18:42:52] (03PS3) 10Ebernhardson: Remove i18n and IS references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814873 (https://phabricator.wikimedia.org/T313248) [18:43:48] (03PS2) 10Zabe: cassandra: Add SPDX headers to cassandra profile [puppet] - 10https://gerrit.wikimedia.org/r/814876 (https://phabricator.wikimedia.org/T308013) [18:48:39] (03PS2) 10Zabe: conftool: Add SPDX headers to conftool profile [puppet] - 10https://gerrit.wikimedia.org/r/814880 (https://phabricator.wikimedia.org/T308013) [18:53:52] (03PS2) 10Zabe: cumin: Add SPDX headers to cumin profile [puppet] - 10https://gerrit.wikimedia.org/r/814881 (https://phabricator.wikimedia.org/T308013) [18:57:55] (03PS2) 10Zabe: dbbackups: Add SPDX headers to dbbackups profile [puppet] - 10https://gerrit.wikimedia.org/r/814882 (https://phabricator.wikimedia.org/T308013) [19:00:28] (03PS2) 10Zabe: debdeploy: Add SPDX headers to debdeploy profile [puppet] - 10https://gerrit.wikimedia.org/r/814883 (https://phabricator.wikimedia.org/T308013) [19:02:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2066.codfw.wmnet with OS bullseye [19:03:12] (03CR) 10Clare Ming: [C: 03+1] Collapse sidebar by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814865 (https://phabricator.wikimedia.org/T287609) (owner: 10Jdlrobson) [19:04:12] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [19:09:52] (03CR) 10Cwhite: logstash: enable pipeline-managed index patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [19:11:42] (03PS1) 10Andrew Bogott: magnum.conf: use trustee_domain_admin_domain_name instead of _id [puppet] - 10https://gerrit.wikimedia.org/r/814886 [19:12:52] (03CR) 10Andrew Bogott: [C: 03+2] magnum.conf: use trustee_domain_admin_domain_name instead of _id [puppet] - 10https://gerrit.wikimedia.org/r/814886 (owner: 10Andrew Bogott) [19:13:45] (03CR) 10Urbanecm: [C: 03+1] Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [19:19:53] (03CR) 10Cwhite: logstash: duplicate alert logs for loki target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [19:20:27] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (The Decommission Mission 💀): replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10dancy) [19:31:05] (03PS2) 10Urbanecm: [beta] GrowthExperiments: Remove variables that are primarily set on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811663 [19:31:19] (03CR) 10Urbanecm: [C: 03+2] "beta-only, should be no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811663 (owner: 10Urbanecm) [19:31:59] (03CR) 10Andrew Bogott: [C: 03+2] Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [19:32:12] (03PS5) 10Andrew Bogott: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 [19:32:16] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Remove variables that are primarily set on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811663 (owner: 10Urbanecm) [19:35:07] (03CR) 10Urbanecm: [C: 03+1] "Code is good and happy to deploy this. I have few questions (see the CU patch), but no issues with pinning this variable to SCHEMA_COMPAT_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [19:38:06] (03CR) 10Urbanecm: [C: 04-1] "Please optimize the SVG files (see https://www.mediawiki.org/wiki/Manual:Assets#SVG_files for details on how). For any SVG resource in the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [19:40:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:41:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:41:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:42:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:42:42] (03CR) 10Urbanecm: [C: 04-1] "one additional request: would it be possible to document the SVG file you used? a link to commons on the task would be ideal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [19:43:46] (03CR) 10Urbanecm: [C: 03+2] "Backport, starting CI slightly ahead B&C to save a bit of time." [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814769 (https://phabricator.wikimedia.org/T313188) (owner: 10Bartosz Dziewoński) [19:45:25] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2066.codfw.wmnet with OS bullseye [19:52:11] (03PS1) 10Andrew Bogott: Make cloudcontrol100[67] into live cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/814890 (https://phabricator.wikimedia.org/T306853) [19:54:59] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol100[67] into live cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/814890 (https://phabricator.wikimedia.org/T306853) (owner: 10Andrew Bogott) [20:00:02] (03CR) 10Cwhite: [C: 03+2] profile: make loki data directory configurable [puppet] - 10https://gerrit.wikimedia.org/r/813715 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T2000). [20:00:04] zabe, Tks4Fish, sergi0, MatmaRex, ebernhardson, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hi! i can deploy today [20:00:14] \o [20:00:16] hello [20:00:19] we're quite full today [20:00:32] hi [20:01:23] it's likely we won't have time for some of the patches: are there any urgent patches that need to go out today? similarly, are there any non-urgent patches that can be skipped if needed? (thanks Jdlrobson for providing this info in the calendar) [20:01:33] hey [20:01:58] (03PS2) 10Urbanecm: Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [20:02:03] (03CR) 10Urbanecm: [C: 03+2] Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [20:02:14] o/ present [20:02:31] (03PS1) 10Ahmon Dancy: Handle socket.timeout the same way as TimeoutError [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814893 [20:02:32] urbanecm: we can re-schedule 814789 for tomorrow if that helps [20:02:36] urbanecm: i was worried it would be busy this morning :) [20:03:00] (03Merged) 10jenkins-bot: Ensure custom locales for Moment.js overrides, don't change 'en' [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814769 (https://phabricator.wikimedia.org/T313188) (owner: 10Bartosz Dziewoński) [20:03:11] (03Merged) 10jenkins-bot: Mentorship: enable the Vue version of the dashboard in test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814789 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [20:03:52] urbanecm: mine isn't particularly time critical, but i would like to get moving on the things it blocks. [20:03:52] sergi0: MatmaRex: your patches are at mwdebug1001, can you check? [20:04:10] checking [20:04:21] (03PS2) 10Urbanecm: Collapse sidebar by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814865 (https://phabricator.wikimedia.org/T287609) (owner: 10Jdlrobson) [20:04:27] looking [20:04:28] (03CR) 10Urbanecm: [C: 03+2] Collapse sidebar by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814865 (https://phabricator.wikimedia.org/T287609) (owner: 10Jdlrobson) [20:05:20] urbanecm: looks good [20:05:21] (03Merged) 10jenkins-bot: Collapse sidebar by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814865 (https://phabricator.wikimedia.org/T287609) (owner: 10Jdlrobson) [20:05:23] (03CR) 10CI reject: [V: 04-1] Handle socket.timeout the same way as TimeoutError [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814893 (owner: 10Ahmon Dancy) [20:05:24] urbanecm: all good from my end [20:05:52] zabe: hi, i commented on the CU patch. can you confirm beta should be pinned to old too? [20:05:55] thanks MatmaRex and sergi0, syncing [20:06:30] urbanecm, there is not checkuser on beta [20:06:37] s/not/no [20:06:46] somewhat, i forgot about that. [20:06:53] all good then :) [20:07:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:57] (03PS2) 10Urbanecm: Pin cu_log actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:08:07] (03CR) 10Urbanecm: [C: 03+2] Pin cu_log actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:08:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:09:03] (03Merged) 10jenkins-bot: Pin cu_log actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:09:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 76b7cd6379c25175570eeeb2a305de0fd0bc61e5: Mentorship: enable the Vue version of the dashboard in test (T300532) (duration: 03m 00s) [20:10:39] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [20:11:18] (03PS1) 10BCornwall: Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) [20:11:41] Jdlrobson: your patch is at mwdebug1001, can you check? [20:11:49] (syncing the wmf.19 backport in the meanwhile) [20:13:20] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.19/resources/src/moment/moment-locale-overrides.js: c4d8a217b4ce0a9f7aefaacc032136e7eb058d4d: Ensure custom locales for Moment.js overrides, dont change en (T313188) (duration: 02m 44s) [20:13:24] T313188: Most of the reply links don't work on ckbwiki - https://phabricator.wikimedia.org/T313188 [20:13:48] urbanecm: on it [20:14:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:29] (03Abandoned) 10Urbanecm: reindex: Detect index type from live mappings [extensions/CirrusSearch] (wmf/1.39.0-wmf.20) - 10https://gerrit.wikimedia.org/r/814770 (owner: 10Ebernhardson) [20:14:34] thanks [20:14:40] urbanecm: LGTM! feel free to sync! [20:14:42] ebernhardson: abandoned the wmf.20 version, as wmf.20 will not be deployed. [20:14:45] Jdlrobson: syncing, thanks! [20:15:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:55] (03CR) 10Urbanecm: [C: 03+2] "backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814771 (owner: 10Ebernhardson) [20:16:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:16] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 415c4ef44d9bf1abab6942fbbc552990a8e992c8: Collapse sidebar by default for anonymous users (T287609) (duration: 02m 41s) [20:18:22] T287609: Collapse sidebar by default for logged-out people - https://phabricator.wikimedia.org/T287609 [20:18:23] Jdlrobson: fyi, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/814867 looks to have an unanswered comment by cj.ming. can you check it please? :) [20:18:26] (patch deployed) [20:18:27] urbanecm: oh, i didn't remember that (but now that you mention it, i remember the email notice). thakns [20:18:39] urbanecm: looking [20:18:56] ah yeh that makes sense. I'll amend it now [20:19:05] zabe: just syncing yours, as there is nothing to test anyway :) [20:19:16] ok [20:19:32] thanks Jdlrobson :) [20:20:30] (03PS2) 10Jdlrobson: Enable language switching button for logged-out users on non-pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) [20:21:25] (03PS3) 10Urbanecm: Enable language switching button for logged-out users on non-pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) (owner: 10Jdlrobson) [20:21:29] (03CR) 10Urbanecm: [C: 03+2] Enable language switching button for logged-out users on non-pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) (owner: 10Jdlrobson) [20:21:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f99c5331380a8c03f4c447e2f73cb76afca337a2: Pin cu_log actor migration to old schema (T233004) (duration: 02m 41s) [20:21:38] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:22:34] (03CR) 10Urbanecm: [C: 03+2] Enable language switching button for logged-out users on non-pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) (owner: 10Jdlrobson) [20:23:24] (03Merged) 10jenkins-bot: Enable language switching button for logged-out users on non-pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814867 (https://phabricator.wikimedia.org/T312861) (owner: 10Jdlrobson) [20:23:50] Jdlrobson: pulled r814867 to mwdebug1001, can you check please? [20:24:23] checking [20:25:29] thanks urbanecm [20:25:43] (03CR) 10Urbanecm: "I must be missing something here. This appears to be already the case at wikisources?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [20:25:55] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:26:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:19] (03PS1) 10Andrew Bogott: acme_chief: allow access for cloudcontrol100[67] [puppet] - 10https://gerrit.wikimedia.org/r/814895 (https://phabricator.wikimedia.org/T306853) [20:27:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:41] urbanecm: looking into the wikisource issue.. the max width appears to be enabled on https://en.wikisource.org/wiki/Popular_Science_Monthly/Volume_31/May_1887/Megalithic_Monuments_in_Spain_and_Portugal and we want to remove it.. trying to work out why [20:28:03] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: allow access for cloudcontrol100[67] [puppet] - 10https://gerrit.wikimedia.org/r/814895 (https://phabricator.wikimedia.org/T306853) (owner: 10Andrew Bogott) [20:28:13] Jdlrobson: does that mean r814867 can be synced? or are you still testing? :) [20:28:17] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:28:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:41] 814867 can be synced [20:28:53] syncing [20:30:16] (03CR) 10Jdlrobson: [C: 04-1] "Fixing. Config name was unintuitive :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [20:30:49] looks like you figured the wikisources out, lmk if i can help with that. [20:31:27] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:37] (03PS2) 10Jdlrobson: Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) [20:31:57] ok urbanecm thanks for the eagle eyes on that patch :) saved us a few minutes of "why is this not working?" :) [20:32:04] (03CR) 10CI reject: [V: 04-1] Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [20:32:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1c258b25e8a47caf9d531f01798d32cd3f9b1605: Enable language switching button for logged-out users on non-pilot wikis (T312861) (duration: 02m 43s) [20:32:15] (03PS3) 10Jdlrobson: Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) [20:32:18] T312861: Enable language switching button for logged-out users on non-pilot wikis - https://phabricator.wikimedia.org/T312861 [20:32:22] Jdlrobson: always happy to help! [20:33:24] (03CR) 10Urbanecm: [C: 03+2] Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [20:33:29] let's try it out [20:34:29] (03Merged) 10jenkins-bot: Turn off fixed width in main namespace on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814868 (https://phabricator.wikimedia.org/T311607) (owner: 10Jdlrobson) [20:34:54] Jdlrobson: pulled to mwdebug1001, can you test? [20:35:05] urbanecm: testing [20:35:27] urbanecm: yep that looks like it worked! [20:35:49] (03Merged) 10jenkins-bot: reindex: Detect index type from live mappings [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/814771 (owner: 10Ebernhardson) [20:36:08] Please sync [20:36:42] great! syncing [20:37:19] ebernhardson: pulled your patch to mwdebug1001, if it's testable there [20:37:25] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:37:48] urbanecm: yup, i can reindex testwiki. will take a couple minutes to run [20:38:04] ebernhardson: does that work with debug servers? [20:38:14] urbanecm: yea, it just sends some requests to elasticsearch and then waits [20:38:19] urbanecm: thanks for all the help today! [20:38:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:11] ebernhardson: i see. I'll leave it up to you: if you think the test's useful, feel free to do it, otherwise, i can sync it directly too. [20:39:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:39:28] urbanecm: it's running now, which means it already got past the point that it used to fail and the patch should work [20:39:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:58] sounds great! i'll sync it :) [20:40:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8d1663c93d2ddeb107d5f9b8982a7f4a7b880aba: Turn off fixed width in main namespace on Wikisource ( T311607) (duration: 02m 41s) [20:40:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:40:27] T311607: Turn off fixed width in main namespace on Wikisource - https://phabricator.wikimedia.org/T311607 [20:40:45] Jdlrobson: should be live! [20:45:13] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/CirrusSearch/: 930ecb76a5a9266d498f40b49ab5ff82c01dbcf5: reindex: Detect index type from live mappings (duration: 02m 55s) [20:45:23] ebernhardson: and, should be live [20:45:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:45:36] urbanecm: thanks! [20:45:41] np [20:45:53] !log UTC late B&C window finished [20:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:46:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:47:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) [20:50:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) These hosts are now in service and seem to be working. [20:58:05] !log start reindex of all wikis except commonswiki and wikidatawiki in eqiad and codfw cirrus clusters [20:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T2100). [21:01:49] (03CR) 10Jdlrobson: Deploy the new grid layout (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [21:28:30] Hey all - would like to quickly deploy a sec patch for T309894. Let me know if I should wait. [21:31:28] jouncebot: now [21:31:28] For the next 1 hour(s) and 28 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220718T2100) [21:31:54] sbassett: ^ looks like you already have the conch :) [21:32:16] bd808: yep, I just always like to double-check in case someone is fighting a fire :) [21:34:00] no fires yet, have at it :D [21:36:02] !log Deployed security fix for T309894 [21:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:45:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:45:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:46:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:01:19] (03PS2) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [22:01:21] (03PS1) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) [22:01:23] (03PS1) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) [22:01:42] (03CR) 10CI reject: [V: 04-1] Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [22:01:57] (03CR) 10CI reject: [V: 04-1] Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [22:02:05] (03PS1) 10Ebernhardson: cirrus: Dont recycle completion suggester indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814908 [22:02:07] (03PS1) 10Ebernhardson: Revert "cirrus: Dont recycle completion suggester indices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814909 [22:02:15] (03CR) 10CI reject: [V: 04-1] Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [22:05:34] (03PS3) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [22:05:54] (03CR) 10CI reject: [V: 04-1] Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [22:06:01] (03PS4) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [22:06:09] (03PS5) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [22:06:21] (03PS2) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) [22:06:25] (03PS2) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) [22:06:43] (03PS3) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) [22:06:58] (03PS4) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) [22:07:33] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:52] (03CR) 10Jdlrobson: Deploy the new grid layout to group 0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [22:13:27] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:10] (03PS1) 10Andrew Bogott: Install nova on new cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/814911 (https://phabricator.wikimedia.org/T305194) [22:26:17] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:32:19] (03CR) 10Cwhite: [C: 03+2] hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:32:30] (03CR) 10Andrew Bogott: [C: 03+2] Install nova on new cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/814911 (https://phabricator.wikimedia.org/T305194) (owner: 10Andrew Bogott) [22:33:19] andrewbogott: merged yours as well [22:33:28] thank you! [22:41:49] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:42:23] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:56:57] (03PS1) 10Andrew Bogott: openstack::nova::compute::service: don't add 'nova' user to libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/814913 (https://phabricator.wikimedia.org/T309342) [23:07:01] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1049.eqiad.wmnet [23:07:45] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:13:51] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:55] PROBLEM - ensure kvm processes are running on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:18:52] (03PS1) 10Cwhite: logstash: enable loki public output on production [puppet] - 10https://gerrit.wikimedia.org/r/814915 (https://phabricator.wikimedia.org/T222826) [23:19:25] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt1049.eqiad.wmnet [23:19:55] PROBLEM - ensure kvm processes are running on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:22:49] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/36284/" [puppet] - 10https://gerrit.wikimedia.org/r/814915 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:23:00] PROBLEM - ensure kvm processes are running on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:27:25] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:05] PROBLEM - ensure kvm processes are running on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:28:26] (03PS2) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [23:29:20] (03CR) 10Hashar: "And on commenting on a change I get in the error log:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [23:29:41] PROBLEM - ensure kvm processes are running on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:22] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott new hosts, in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:22] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott new hosts, in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:23] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott new hosts, in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:24] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott new hosts, in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:25] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott new hosts, in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:46:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1050.eqiad.wmnet [23:50:23] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:53:29] RECOVERY - ensure kvm processes are running on cloudvirt1048 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:56:33] RECOVERY - ensure kvm processes are running on cloudvirt1049 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:56:53] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:57:27] RECOVERY - ensure kvm processes are running on cloudvirt1050 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:58:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1050.eqiad.wmnet