[00:01:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P39144 and previous config saved to /var/cache/conftool/dbconfig/20221111-000118-ladsgroup.json [00:01:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [00:01:23] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:01:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [00:01:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [00:02:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [00:02:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P39145 and previous config saved to /var/cache/conftool/dbconfig/20221111-000206-ladsgroup.json [00:04:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P39146 and previous config saved to /var/cache/conftool/dbconfig/20221111-000425-ladsgroup.json [00:10:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321123)', diff saved to https://phabricator.wikimedia.org/P39147 and previous config saved to /var/cache/conftool/dbconfig/20221111-001056-marostegui.json [00:10:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [00:11:01] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [00:11:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [00:11:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [00:11:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [00:11:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321123)', diff saved to https://phabricator.wikimedia.org/P39148 and previous config saved to /var/cache/conftool/dbconfig/20221111-001156-marostegui.json [00:14:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321123)', diff saved to https://phabricator.wikimedia.org/P39149 and previous config saved to /var/cache/conftool/dbconfig/20221111-001406-marostegui.json [00:15:43] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Jclark-ctr) [00:16:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Jclark-ctr) 05In progress→03Resolved [00:19:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P39150 and previous config saved to /var/cache/conftool/dbconfig/20221111-001932-ladsgroup.json [00:29:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39151 and previous config saved to /var/cache/conftool/dbconfig/20221111-002913-marostegui.json [00:31:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:31:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:31:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39152 and previous config saved to /var/cache/conftool/dbconfig/20221111-003141-ladsgroup.json [00:31:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:34:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P39153 and previous config saved to /var/cache/conftool/dbconfig/20221111-003438-ladsgroup.json [00:38:36] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:38:37] (03PS8) 10Andrew Bogott: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [00:38:39] (03PS8) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [00:41:47] (03CR) 10Andrew Bogott: Add cookbook to restart openstack services (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [00:41:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:42:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:43:19] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:43:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P39154 and previous config saved to /var/cache/conftool/dbconfig/20221111-004419-marostegui.json [00:44:30] (03PS9) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [00:45:09] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [00:46:45] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:47:01] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [00:47:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:47:25] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Jclark-ctr) [00:48:03] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [00:49:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P39155 and previous config saved to /var/cache/conftool/dbconfig/20221111-004945-ladsgroup.json [00:49:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:49:49] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:50:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:50:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39156 and previous config saved to /var/cache/conftool/dbconfig/20221111-005017-ladsgroup.json [00:50:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:52:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39157 and previous config saved to /var/cache/conftool/dbconfig/20221111-005237-ladsgroup.json [00:59:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321123)', diff saved to https://phabricator.wikimedia.org/P39158 and previous config saved to /var/cache/conftool/dbconfig/20221111-005925-marostegui.json [00:59:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:59:31] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [00:59:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:59:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321123)', diff saved to https://phabricator.wikimedia.org/P39159 and previous config saved to /var/cache/conftool/dbconfig/20221111-005947-marostegui.json [01:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321123)', diff saved to https://phabricator.wikimedia.org/P39160 and previous config saved to /var/cache/conftool/dbconfig/20221111-010156-marostegui.json [01:07:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P39161 and previous config saved to /var/cache/conftool/dbconfig/20221111-010743-ladsgroup.json [01:17:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39162 and previous config saved to /var/cache/conftool/dbconfig/20221111-011703-marostegui.json [01:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P39163 and previous config saved to /var/cache/conftool/dbconfig/20221111-012250-ladsgroup.json [01:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39164 and previous config saved to /var/cache/conftool/dbconfig/20221111-013157-ladsgroup.json [01:32:03] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P39165 and previous config saved to /var/cache/conftool/dbconfig/20221111-013209-marostegui.json [01:37:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39166 and previous config saved to /var/cache/conftool/dbconfig/20221111-013756-ladsgroup.json [01:37:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:38:01] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:38:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:38:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39167 and previous config saved to /var/cache/conftool/dbconfig/20221111-013818-ladsgroup.json [01:38:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39168 and previous config saved to /var/cache/conftool/dbconfig/20221111-014037-ladsgroup.json [01:47:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P39169 and previous config saved to /var/cache/conftool/dbconfig/20221111-014704-ladsgroup.json [01:47:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T318605)', diff saved to https://phabricator.wikimedia.org/P39170 and previous config saved to /var/cache/conftool/dbconfig/20221111-014712-ladsgroup.json [01:47:16] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321123)', diff saved to https://phabricator.wikimedia.org/P39171 and previous config saved to /var/cache/conftool/dbconfig/20221111-014722-marostegui.json [01:47:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:47:27] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [01:47:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321123)', diff saved to https://phabricator.wikimedia.org/P39172 and previous config saved to /var/cache/conftool/dbconfig/20221111-014744-marostegui.json [01:48:52] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321123)', diff saved to https://phabricator.wikimedia.org/P39173 and previous config saved to /var/cache/conftool/dbconfig/20221111-014953-marostegui.json [01:53:52] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P39174 and previous config saved to /var/cache/conftool/dbconfig/20221111-015544-ladsgroup.json [02:02:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P39175 and previous config saved to /var/cache/conftool/dbconfig/20221111-020211-ladsgroup.json [02:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P39176 and previous config saved to /var/cache/conftool/dbconfig/20221111-020218-ladsgroup.json [02:05:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39177 and previous config saved to /var/cache/conftool/dbconfig/20221111-020500-marostegui.json [02:08:52] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P39178 and previous config saved to /var/cache/conftool/dbconfig/20221111-021051-ladsgroup.json [02:14:20] (03PS10) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:17:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39179 and previous config saved to /var/cache/conftool/dbconfig/20221111-021717-ladsgroup.json [02:17:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:17:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P39180 and previous config saved to /var/cache/conftool/dbconfig/20221111-021725-ladsgroup.json [02:17:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:17:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39181 and previous config saved to /var/cache/conftool/dbconfig/20221111-021738-ladsgroup.json [02:18:52] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P39182 and previous config saved to /var/cache/conftool/dbconfig/20221111-022006-marostegui.json [02:20:58] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:25:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39183 and previous config saved to /var/cache/conftool/dbconfig/20221111-022557-ladsgroup.json [02:25:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [02:26:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:26:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [02:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39184 and previous config saved to /var/cache/conftool/dbconfig/20221111-022619-ladsgroup.json [02:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39185 and previous config saved to /var/cache/conftool/dbconfig/20221111-022838-ladsgroup.json [02:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T318605)', diff saved to https://phabricator.wikimedia.org/P39186 and previous config saved to /var/cache/conftool/dbconfig/20221111-023231-ladsgroup.json [02:32:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [02:32:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [02:32:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39187 and previous config saved to /var/cache/conftool/dbconfig/20221111-023252-ladsgroup.json [02:35:08] (03PS11) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:35:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321123)', diff saved to https://phabricator.wikimedia.org/P39188 and previous config saved to /var/cache/conftool/dbconfig/20221111-023513-marostegui.json [02:35:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:35:18] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [02:35:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39189 and previous config saved to /var/cache/conftool/dbconfig/20221111-023534-marostegui.json [02:36:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39190 and previous config saved to /var/cache/conftool/dbconfig/20221111-023643-marostegui.json [02:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P39191 and previous config saved to /var/cache/conftool/dbconfig/20221111-024345-ladsgroup.json [02:51:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39192 and previous config saved to /var/cache/conftool/dbconfig/20221111-025150-marostegui.json [02:58:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P39193 and previous config saved to /var/cache/conftool/dbconfig/20221111-025851-ladsgroup.json [03:06:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P39194 and previous config saved to /var/cache/conftool/dbconfig/20221111-030656-marostegui.json [03:13:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39195 and previous config saved to /var/cache/conftool/dbconfig/20221111-031358-ladsgroup.json [03:14:03] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [03:14:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:15:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:19:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39196 and previous config saved to /var/cache/conftool/dbconfig/20221111-032203-marostegui.json [03:22:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:22:09] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [03:22:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:22:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39197 and previous config saved to /var/cache/conftool/dbconfig/20221111-032224-marostegui.json [03:22:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39198 and previous config saved to /var/cache/conftool/dbconfig/20221111-032434-marostegui.json [03:24:44] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Beta mwmaint puppet runs fail with "Resource type not found: Profile::Lvs::Classes" - https://phabricator.wikimedia.org/T322901 (10Tgr) [03:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39199 and previous config saved to /var/cache/conftool/dbconfig/20221111-033940-marostegui.json [03:44:55] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:46:53] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:51:12] 10Puppet, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible: Beta mwmaint puppet runs fail with "Resource type not found: Profile::Lvs::Classes" - https://phabricator.wikimedia.org/T322901 (10Tgr) `modules/profile/types/lvs/classes.pp` is physically not present on deployment-puppetmaster04. Which wou... [03:52:47] ^ seems to be a production puppet bug, could someone more familiar with the codebase look into it? [03:54:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P39200 and previous config saved to /var/cache/conftool/dbconfig/20221111-035447-marostegui.json [04:03:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:09:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P39201 and previous config saved to /var/cache/conftool/dbconfig/20221111-040953-marostegui.json [04:09:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [04:09:59] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [04:10:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [04:10:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [04:10:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [04:10:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T321123)', diff saved to https://phabricator.wikimedia.org/P39202 and previous config saved to /var/cache/conftool/dbconfig/20221111-041030-marostegui.json [04:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321123)', diff saved to https://phabricator.wikimedia.org/P39203 and previous config saved to /var/cache/conftool/dbconfig/20221111-041139-marostegui.json [04:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P39204 and previous config saved to /var/cache/conftool/dbconfig/20221111-042646-marostegui.json [04:31:05] (03PS4) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [04:31:45] (03CR) 10CI reject: [V: 04-1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [04:41:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P39205 and previous config saved to /var/cache/conftool/dbconfig/20221111-044152-marostegui.json [04:44:26] (03PS5) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [04:46:54] (03CR) 10CI reject: [V: 04-1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [04:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321123)', diff saved to https://phabricator.wikimedia.org/P39206 and previous config saved to /var/cache/conftool/dbconfig/20221111-045659-marostegui.json [04:57:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [04:57:04] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [04:57:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [04:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T321123)', diff saved to https://phabricator.wikimedia.org/P39207 and previous config saved to /var/cache/conftool/dbconfig/20221111-045720-marostegui.json [04:59:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321123)', diff saved to https://phabricator.wikimedia.org/P39208 and previous config saved to /var/cache/conftool/dbconfig/20221111-045930-marostegui.json [05:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39209 and previous config saved to /var/cache/conftool/dbconfig/20221111-051436-marostegui.json [05:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P39210 and previous config saved to /var/cache/conftool/dbconfig/20221111-052943-marostegui.json [05:44:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321123)', diff saved to https://phabricator.wikimedia.org/P39211 and previous config saved to /var/cache/conftool/dbconfig/20221111-054449-marostegui.json [05:44:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [05:44:55] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [05:45:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [05:45:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321123)', diff saved to https://phabricator.wikimedia.org/P39212 and previous config saved to /var/cache/conftool/dbconfig/20221111-054511-marostegui.json [05:47:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321123)', diff saved to https://phabricator.wikimedia.org/P39213 and previous config saved to /var/cache/conftool/dbconfig/20221111-054720-marostegui.json [05:56:08] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [06:02:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39214 and previous config saved to /var/cache/conftool/dbconfig/20221111-060227-marostegui.json [06:02:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:02:57] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:17:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P39215 and previous config saved to /var/cache/conftool/dbconfig/20221111-061733-marostegui.json [06:22:02] !log restart varnish on cp4047 to clear VarnishChildRestarted alert - T322903 [06:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:07] T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 [06:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321123)', diff saved to https://phabricator.wikimedia.org/P39216 and previous config saved to /var/cache/conftool/dbconfig/20221111-063240-marostegui.json [06:32:45] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:57:44] <_joe_> uhm why is it still saying XioNoX even if it's me? [06:58:13] <_joe_> because sirenbot crashed I guess [06:59:44] <_joe_> ah no, it reconnected and wasn't automatically made operator like it should, sigh [07:11:33] 10SRE, 10Traffic: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) Free memory on NUMA Node 0 got below the min threshold (1028416 < 1041448): `Node 0 Normal free:1028416kB min:1041448kB low:1303560kB high:1565672kB reserved_highatomic:2048KB active_anon:1800292kB inactiv... [07:12:25] 10SRE, 10Traffic: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [07:21:49] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:22:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39217 and previous config saved to /var/cache/conftool/dbconfig/20221111-075028-ladsgroup.json [07:50:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221111T0800) [08:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P39218 and previous config saved to /var/cache/conftool/dbconfig/20221111-080536-ladsgroup.json [08:09:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:09:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Remove from cluster for eventual reimage [08:14:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1020.eqiad.wmnet with OS bullseye [08:14:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS bullseye [08:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P39219 and previous config saved to /var/cache/conftool/dbconfig/20221111-082042-ladsgroup.json [08:28:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1020.eqiad.wmnet with reason: host reimage [08:32:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1020.eqiad.wmnet with reason: host reimage [08:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39220 and previous config saved to /var/cache/conftool/dbconfig/20221111-083549-ladsgroup.json [08:35:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [08:35:54] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:36:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [08:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T318605)', diff saved to https://phabricator.wikimedia.org/P39221 and previous config saved to /var/cache/conftool/dbconfig/20221111-083611-ladsgroup.json [08:39:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39222 and previous config saved to /var/cache/conftool/dbconfig/20221111-083922-ladsgroup.json [08:49:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1020.eqiad.wmnet with OS bullseye [08:49:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS bullseye completed: - ganeti1020 (**PASS**) - Downtimed on... [08:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P39223 and previous config saved to /var/cache/conftool/dbconfig/20221111-085428-ladsgroup.json [08:55:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [09:01:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [09:02:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [09:02:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:02:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:03:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [09:03:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:04:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:06:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1020.eqiad.wmnet to cluster eqiad and group D [09:06:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:06:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:07:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1020.eqiad.wmnet to cluster eqiad and group D [09:08:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:08:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:08:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39224 and previous config saved to /var/cache/conftool/dbconfig/20221111-090846-marostegui.json [09:08:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P39225 and previous config saved to /var/cache/conftool/dbconfig/20221111-090935-ladsgroup.json [09:10:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10jbond) it would be useful to understand why theses changes where reverted to avoid issues in the future. [09:15:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39226 and previous config saved to /var/cache/conftool/dbconfig/20221111-091514-marostegui.json [09:15:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:16:50] (03PS1) 10Marostegui: add_cul_actor_T321126.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/855959 (https://phabricator.wikimedia.org/T321126) [09:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39227 and previous config saved to /var/cache/conftool/dbconfig/20221111-092441-ladsgroup.json [09:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [09:24:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:24:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [09:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39228 and previous config saved to /var/cache/conftool/dbconfig/20221111-092503-ladsgroup.json [09:30:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P39229 and previous config saved to /var/cache/conftool/dbconfig/20221111-093020-marostegui.json [09:32:04] (03PS1) 10JMeybohm: Update to v1.23.14 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/855961 (https://phabricator.wikimedia.org/T307943) [09:34:01] (03CR) 10JMeybohm: [C: 03+2] Update to v1.23.14 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/855961 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:35:15] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [09:39:47] (03PS1) 10Vgutierrez: hieradata: unify ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) [09:40:10] (03PS2) 10Vgutierrez: hieradata: unify cp@ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) [09:40:53] (03CR) 10CI reject: [V: 04-1] hieradata: unify cp@ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) (owner: 10Vgutierrez) [09:41:40] (03Abandoned) 10Phuedx: wgWMESchemaEditAttemptStepSamplingRate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854006 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [09:43:06] (03PS3) 10Vgutierrez: hieradata: unify cp@ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) [09:45:25] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Gehel) >>! In T320482#8385142, @RKemper wrote: > @Papaul Yup per jbond's comment above we're still seeing the RAID issue. Could we try either rebuilding raid with the current dis... [09:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P39230 and previous config saved to /var/cache/conftool/dbconfig/20221111-094526-marostegui.json [09:45:56] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [09:46:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [09:47:24] (03PS4) 10Vgutierrez: hieradata: unify cp@ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) [09:53:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38103/console" [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) (owner: 10Vgutierrez) [09:54:22] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hieradata: unify cp@ulsfo definitions [puppet] - 10https://gerrit.wikimedia.org/r/855962 (https://phabricator.wikimedia.org/T317244) (owner: 10Vgutierrez) [09:54:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [09:55:13] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [09:57:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: cleanup SAL log messages (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 (owner: 10Arturo Borrero Gonzalez) [10:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39231 and previous config saved to /var/cache/conftool/dbconfig/20221111-100033-marostegui.json [10:00:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:00:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:00:44] (03Merged) 10jenkins-bot: wmcs: cleanup SAL log messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 (owner: 10Arturo Borrero Gonzalez) [10:00:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:00:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39232 and previous config saved to /var/cache/conftool/dbconfig/20221111-100054-marostegui.json [10:00:59] (03PS1) 10Vgutierrez: hieradata: clean up unused esams role cache::(text|upload) definitions [puppet] - 10https://gerrit.wikimedia.org/r/855964 [10:01:23] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:01:52] (03CR) 10Vgutierrez: [C: 03+2] hieradata: clean up unused esams role cache::(text|upload) definitions [puppet] - 10https://gerrit.wikimedia.org/r/855964 (owner: 10Vgutierrez) [10:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39233 and previous config saved to /var/cache/conftool/dbconfig/20221111-100725-marostegui.json [10:07:29] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:09:00] (03PS1) 10Vgutierrez: varnish: Increase reserved memory to 120G in upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/855965 (https://phabricator.wikimedia.org/T322903) [10:12:01] (03CR) 10Vgutierrez: "Please see https://phabricator.wikimedia.org/T322903" [puppet] - 10https://gerrit.wikimedia.org/r/849633 (owner: 10BBlack) [10:13:25] (03CR) 10David Caro: [C: 03+1] wmcs: cleanup SAL log messages (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 (owner: 10Arturo Borrero Gonzalez) [10:14:50] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38104/console" [puppet] - 10https://gerrit.wikimedia.org/r/855965 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [10:15:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [10:18:57] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [10:22:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P39234 and previous config saved to /var/cache/conftool/dbconfig/20221111-102231-marostegui.json [10:22:47] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [10:26:18] (03PS2) 10Giuseppe Lavagetto: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [10:29:22] RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.87 ms [10:35:16] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev: use standard partman recipes for raid1 on 2 devices [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) [10:36:31] (03CR) 10Arturo Borrero Gonzalez: "@andrew this is for you to consider. It is 100% untested." [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) (owner: 10Arturo Borrero Gonzalez) [10:37:01] (03PS1) 10Elukey: istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) [10:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P39235 and previous config saved to /var/cache/conftool/dbconfig/20221111-103738-marostegui.json [10:37:54] PROBLEM - IPMI Sensor Status on restbase1018 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:40:16] (03PS2) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev: use standard partman recipes for raid1 on 2 devices [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) [10:40:43] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) After further inspection I don't think that ATS memory increase is enough to explain what we are seeing here, text nodes in ulsfo are using around 326G of RAM but upload ones are usin... [10:42:03] (03CR) 10CI reject: [V: 04-1] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [10:42:54] (03PS3) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev: use standard partman recipes for raid1 on 2 devices [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) [10:45:13] (03CR) 10JMeybohm: "I would suggest to add a comment to "deleting" spec.strategy (maybe even linking to https://istio.io/latest/docs/reference/config/istio.op" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [10:48:54] (03PS1) 10JMeybohm: k8s: Stop docker/runc spam from being written to syslog [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) [10:49:20] (03PS2) 10JMeybohm: k8s: Stop docker/runc spam from being written to syslog [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) [10:52:31] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [10:52:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [10:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39236 and previous config saved to /var/cache/conftool/dbconfig/20221111-105244-marostegui.json [10:52:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:52:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:52:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:53:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T321130)', diff saved to https://phabricator.wikimedia.org/P39237 and previous config saved to /var/cache/conftool/dbconfig/20221111-105305-marostegui.json [10:53:45] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38105/console" [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:54:38] (03PS1) 10Arturo Borrero Gonzalez: prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) [10:56:47] (03CR) 10CI reject: [V: 04-1] prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez) [10:56:49] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) In fact it seems like varnish is the one eating the extra memory... in cp4045 (upload) with the following malloc specific config: `-s malloc,283G -s Transient=malloc,10G` varnish is c... [10:59:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321130)', diff saved to https://phabricator.wikimedia.org/P39238 and previous config saved to /var/cache/conftool/dbconfig/20221111-105918-marostegui.json [10:59:23] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:59:39] (03PS2) 10Arturo Borrero Gonzalez: prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) [11:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:03:07] !log installing wireshark security updates [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:08:18] (03PS2) 10Elukey: istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) [11:14:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P39239 and previous config saved to /var/cache/conftool/dbconfig/20221111-111424-marostegui.json [11:18:24] (03PS1) 10Muehlenhoff: Add ganeti1033 [puppet] - 10https://gerrit.wikimedia.org/r/855973 (https://phabricator.wikimedia.org/T314303) [11:20:37] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti1033 [puppet] - 10https://gerrit.wikimedia.org/r/855973 (https://phabricator.wikimedia.org/T314303) (owner: 10Muehlenhoff) [11:25:58] (03CR) 10Hnowlan: Decode poolcounter messages, fix 429 error (033 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [11:28:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1003/38108/" [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez) [11:29:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P39240 and previous config saved to /var/cache/conftool/dbconfig/20221111-112931-marostegui.json [11:40:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) (owner: 10Arturo Borrero Gonzalez) [11:40:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt200[123]-dev: use standard partman recipes for raid1 on 2 devices [puppet] - 10https://gerrit.wikimedia.org/r/855966 (https://phabricator.wikimedia.org/T322911) (owner: 10Arturo Borrero Gonzalez) [11:41:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:42:36] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [11:42:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [11:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321130)', diff saved to https://phabricator.wikimedia.org/P39241 and previous config saved to /var/cache/conftool/dbconfig/20221111-114437-marostegui.json [11:44:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:44:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:44:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39242 and previous config saved to /var/cache/conftool/dbconfig/20221111-114458-marostegui.json [11:45:01] (03PS3) 10Hnowlan: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) [11:45:14] (03PS1) 10Muehlenhoff: Retire raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/855975 (https://phabricator.wikimedia.org/T156955) [11:45:40] (03CR) 10Btullis: istio: change configs to adapt for 1.15.3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [11:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39243 and previous config saved to /var/cache/conftool/dbconfig/20221111-114712-marostegui.json [11:51:27] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [11:51:30] (03CR) 10Gmodena: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [11:51:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [11:52:53] (03CR) 10Vlad.shapik: [C: 03+1] "Looks good to me." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [11:53:58] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [11:54:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [11:58:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Andrew might know or have opinions on this." [puppet] - 10https://gerrit.wikimedia.org/r/855975 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [11:59:20] (03CR) 10Hnowlan: [C: 03+2] Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [12:00:35] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [12:02:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P39244 and previous config saved to /var/cache/conftool/dbconfig/20221111-120219-marostegui.json [12:04:15] (03Merged) 10jenkins-bot: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [12:10:26] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [12:13:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [12:14:06] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [12:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P39245 and previous config saved to /var/cache/conftool/dbconfig/20221111-121725-marostegui.json [12:19:02] (03PS1) 10Hnowlan: thumbor: bump version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/855977 (https://phabricator.wikimedia.org/T233196) [12:27:47] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/855977 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:30:52] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [12:32:20] (03Merged) 10jenkins-bot: thumbor: bump version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/855977 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:32:21] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [12:32:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39246 and previous config saved to /var/cache/conftool/dbconfig/20221111-123232-marostegui.json [12:32:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:32:39] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:32:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:32:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:33:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T321130)', diff saved to https://phabricator.wikimedia.org/P39247 and previous config saved to /var/cache/conftool/dbconfig/20221111-123310-marostegui.json [12:34:02] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti1033.eqiad.wmnet [12:35:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321130)', diff saved to https://phabricator.wikimedia.org/P39248 and previous config saved to /var/cache/conftool/dbconfig/20221111-123524-marostegui.json [12:35:55] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:37:40] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [12:37:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [12:42:23] !log installing debootstrap bugfix updates from buster point release [12:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P39249 and previous config saved to /var/cache/conftool/dbconfig/20221111-125030-marostegui.json [12:55:19] !log jnuche@deploy1002 Started scap: (no justification provided) [12:58:24] (03PS1) 10QChris: Add .gitreview [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855979 [12:58:26] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855979 (owner: 10QChris) [13:01:30] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:01:30] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:03:18] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:05:22] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:05:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P39251 and previous config saved to /var/cache/conftool/dbconfig/20221111-130537-marostegui.json [13:05:41] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:06:00] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:06:00] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:06:00] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:06:00] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:06:00] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:06:00] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:06:00] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [13:06:01] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [13:07:56] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [13:08:01] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:08:01] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [13:08:04] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:10:07] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:10:35] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:12:58] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:12:58] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:12:58] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:12:58] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:12:58] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:12:58] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:12:58] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [13:12:59] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [13:13:08] !log jnuche@deploy1002 sync-world aborted: (no justification provided) (duration: 17m 49s) [13:17:02] ^ please disregards, that was some testing related to scap-based K8s deployments [13:18:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:20:20] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321130)', diff saved to https://phabricator.wikimedia.org/P39252 and previous config saved to /var/cache/conftool/dbconfig/20221111-132043-marostegui.json [13:20:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:20:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:20:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39253 and previous config saved to /var/cache/conftool/dbconfig/20221111-132105-marostegui.json [13:21:48] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:27:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39254 and previous config saved to /var/cache/conftool/dbconfig/20221111-132714-marostegui.json [13:27:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:27:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt2003-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:30:29] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [13:30:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [13:30:54] !log installing procmail security updates [13:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:57] (03CR) 10Hokwelum: [C: 03+1] "Thanks for the update, Dan. WANSecurity is not currently an active mirror, which is why the ipv4 entry still has "wikimedia.wansec.com."" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [13:37:14] (03PS1) 10Marostegui: site.pp: Fix db1206's owner [puppet] - 10https://gerrit.wikimedia.org/r/855982 [13:37:37] (03CR) 10Muehlenhoff: [C: 03+1] site.pp: Fix db1206's owner [puppet] - 10https://gerrit.wikimedia.org/r/855982 (owner: 10Marostegui) [13:37:53] (03CR) 10Marostegui: [C: 03+2] site.pp: Fix db1206's owner [puppet] - 10https://gerrit.wikimedia.org/r/855982 (owner: 10Marostegui) [13:42:00] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:42:09] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:42:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P39255 and previous config saved to /var/cache/conftool/dbconfig/20221111-134221-marostegui.json [13:45:33] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:47:03] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [13:49:52] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [13:50:02] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:51:33] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:53:54] PROBLEM - DPKG on netmon1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T318605)', diff saved to https://phabricator.wikimedia.org/P39256 and previous config saved to /var/cache/conftool/dbconfig/20221111-135506-ladsgroup.json [13:55:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:57:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P39257 and previous config saved to /var/cache/conftool/dbconfig/20221111-135727-marostegui.json [14:01:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:10:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P39258 and previous config saved to /var/cache/conftool/dbconfig/20221111-141012-ladsgroup.json [14:10:40] (03PS1) 10Marostegui: control-mariadb-client-10.4-bullseye: Version change [software] - 10https://gerrit.wikimedia.org/r/855985 (https://phabricator.wikimedia.org/T322620) [14:11:19] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4-bullseye: Version change [software] - 10https://gerrit.wikimedia.org/r/855985 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui) [14:11:56] (03Merged) 10jenkins-bot: control-mariadb-client-10.4-bullseye: Version change [software] - 10https://gerrit.wikimedia.org/r/855985 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui) [14:12:30] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [14:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39259 and previous config saved to /var/cache/conftool/dbconfig/20221111-141233-marostegui.json [14:12:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:12:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye completed:... [14:12:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:13:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:17:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:17:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T321130)', diff saved to https://phabricator.wikimedia.org/P39260 and previous config saved to /var/cache/conftool/dbconfig/20221111-141721-marostegui.json [14:19:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321130)', diff saved to https://phabricator.wikimedia.org/P39261 and previous config saved to /var/cache/conftool/dbconfig/20221111-141935-marostegui.json [14:19:40] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:24:31] (03CR) 10Elukey: istio: change configs to adapt for 1.15.3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [14:25:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P39262 and previous config saved to /var/cache/conftool/dbconfig/20221111-142519-ladsgroup.json [14:31:08] (03CR) 10Elukey: [C: 03+1] "I am very ignorant about the new _helpers templates but afaics it looks good :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855667 (owner: 10Giuseppe Lavagetto) [14:32:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) (owner: 10Dzahn) [14:34:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P39263 and previous config saved to /var/cache/conftool/dbconfig/20221111-143441-marostegui.json [14:35:48] (03PS1) 10Ssingh: Depool ulsfo for resolving varnish issues [dns] - 10https://gerrit.wikimedia.org/r/855987 (https://phabricator.wikimedia.org/T322903) [14:39:09] (03CR) 10Ssingh: "Emergency patch, do not merge." [dns] - 10https://gerrit.wikimedia.org/r/855987 (https://phabricator.wikimedia.org/T322903) (owner: 10Ssingh) [14:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T318605)', diff saved to https://phabricator.wikimedia.org/P39264 and previous config saved to /var/cache/conftool/dbconfig/20221111-144025-ladsgroup.json [14:40:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:40:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:40:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:40:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T318605)', diff saved to https://phabricator.wikimedia.org/P39265 and previous config saved to /var/cache/conftool/dbconfig/20221111-144047-ladsgroup.json [14:47:04] 10SRE-tools, 10DBA, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10User-Kormat: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10Marostegui) What's the status of this task? [14:49:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P39266 and previous config saved to /var/cache/conftool/dbconfig/20221111-144948-marostegui.json [14:55:17] 10SRE-tools, 10DBA, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10User-Kormat: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10jcrespo) I coded RemoteExecution initially for the backup library. But I... [14:57:10] 10SRE-tools, 10DBA, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10User-Kormat: Revert workaround for cumin output verbosity on RemoteExecution (CuminExecution) abstraction - https://phabricator.wikimedia.org/T282775 (10Marostegui) Thanks for the update, so it is still a valid task :) [15:04:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T321130)', diff saved to https://phabricator.wikimedia.org/P39267 and previous config saved to /var/cache/conftool/dbconfig/20221111-150454-marostegui.json [15:04:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:05:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:05:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:05:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39268 and previous config saved to /var/cache/conftool/dbconfig/20221111-150516-marostegui.json [15:08:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:13:00] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-36), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) [15:18:00] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-36), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) 05Open→03Stalled [[ https://gerrit.wikimedia.org/r/c/operations/puppet... [15:20:31] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [15:21:09] !log installing node-end-of-stream security updates [15:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39269 and previous config saved to /var/cache/conftool/dbconfig/20221111-153009-ladsgroup.json [15:30:15] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:33:47] (03PS1) 10Vgutierrez: varnish: Disable THP for varnish on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/855992 (https://phabricator.wikimedia.org/T322903) [15:39:57] (03PS1) 10Ssingh: site: update role for cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/855993 [15:41:04] (03PS2) 10Vgutierrez: varnish: Disable THP for varnish on cp404[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/855992 (https://phabricator.wikimedia.org/T322903) [15:42:22] (03CR) 10Ssingh: [C: 03+2] site: update role for cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/855993 (owner: 10Ssingh) [15:43:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38109/console" [puppet] - 10https://gerrit.wikimedia.org/r/855992 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [15:43:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [15:44:41] (03PS1) 10AikoChou: ml-services: update outlink's model binary and docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/855995 (https://phabricator.wikimedia.org/T322881) [15:45:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P39270 and previous config saved to /var/cache/conftool/dbconfig/20221111-154515-ladsgroup.json [15:49:21] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:49:52] (03CR) 10Elukey: [C: 03+2] ml-services: update outlink's model binary and docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/855995 (https://phabricator.wikimedia.org/T322881) (owner: 10AikoChou) [15:50:21] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [15:51:21] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:51:52] (03PS1) 10Ssingh: Release 0.15.0-2 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) [15:53:35] (03CR) 10Ssingh: "No debian-glue yet but patch submitted for that: Idca43d2bc23c38bd664cdab298dda6541b674c45" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:54:16] (03PS1) 10JMeybohm: k8s: make profile::kubernetes::cluster_cidr mandatory [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) [15:55:46] (03CR) 10JMeybohm: "Please double check your clusters CIDRs!" [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:56:40] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Disable THP for varnish on cp404[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/855992 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [15:56:44] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [15:57:13] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:58:24] !log rolling restart of varnish in cp4045 - cp4050 - T322903 [15:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:29] T322903: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 [16:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P39271 and previous config saved to /var/cache/conftool/dbconfig/20221111-160022-ladsgroup.json [16:00:28] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38110/console" [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:03:07] (03CR) 10Btullis: [C: 03+1] "Thanks. Have double-checked the dse-k8s CIDRs and the two manifest files look good to me." [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:05:19] !log restart varnish in cp2042 [16:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39272 and previous config saved to /var/cache/conftool/dbconfig/20221111-160532-marostegui.json [16:05:37] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:13:29] (03PS1) 10Muehlenhoff: buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/855998 [16:15:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39273 and previous config saved to /var/cache/conftool/dbconfig/20221111-161528-ladsgroup.json [16:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:15:32] (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: hiera: cleanup per-host network overrides [puppet] - 10https://gerrit.wikimedia.org/r/855044 [16:15:35] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:15:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:16:40] 10SRE, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) [16:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:18:42] (03CR) 10Muehlenhoff: [C: 03+2] buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/855998 (owner: 10Muehlenhoff) [16:20:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39274 and previous config saved to /var/cache/conftool/dbconfig/20221111-162038-marostegui.json [16:21:19] (03CR) 10Ssingh: "Since reviewing this might be a bit hard given there is no history in this repository:" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:26:38] (03PS2) 10Ssingh: Release 0.15.0-2 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) [16:28:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF) @fgiunchedi what would the email read like, please? I think I might have lost it in the many updates [16:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39275 and previous config saved to /var/cache/conftool/dbconfig/20221111-163545-marostegui.json [16:39:39] PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:53] er [16:39:58] will file a task [16:42:25] 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10ssingh) [16:42:35] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) p:05High→03Medium Lowing the priority after deploying several experiments in upload@ulsfo that could mitigate the issue, see the task description for more details [16:42:39] 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10ssingh) p:05Triage→03Medium [16:44:47] (03PS3) 10Arturo Borrero Gonzalez: codfw1dev: hiera: cleanup per-host network overrides [puppet] - 10https://gerrit.wikimedia.org/r/855044 [16:49:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC NOOP https://puppet-compiler.wmflabs.org/pcc-worker1001/38112/" [puppet] - 10https://gerrit.wikimedia.org/r/855044 (owner: 10Arturo Borrero Gonzalez) [16:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39277 and previous config saved to /var/cache/conftool/dbconfig/20221111-165051-marostegui.json [16:50:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:50:57] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:51:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T321130)', diff saved to https://phabricator.wikimedia.org/P39278 and previous config saved to /var/cache/conftool/dbconfig/20221111-165113-marostegui.json [16:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321130)', diff saved to https://phabricator.wikimedia.org/P39279 and previous config saved to /var/cache/conftool/dbconfig/20221111-165326-marostegui.json [16:53:46] (03PS1) 10JMeybohm: k8s: Refactor profile::kubernetes::master::service_cluster_ip_range [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) [16:54:49] (03PS1) 10Vgutierrez: varnish: Remove deprecated jemalloc options [puppet] - 10https://gerrit.wikimedia.org/r/856000 [16:55:08] (03CR) 10JMeybohm: "Please double check your service clusters CIDRs!" [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:58:23] (03CR) 10FNegri: [C: 03+1] "PCC looks good, nice cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/855044 (owner: 10Arturo Borrero Gonzalez) [16:58:59] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] codfw1dev: hiera: cleanup per-host network overrides [puppet] - 10https://gerrit.wikimedia.org/r/855044 (owner: 10Arturo Borrero Gonzalez) [17:03:01] (03CR) 10Btullis: [C: 03+1] "Double checked our cluster's CIDRs. Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:03:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38113/console" [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:07:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38114/console" [puppet] - 10https://gerrit.wikimedia.org/r/856000 (owner: 10Vgutierrez) [17:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P39280 and previous config saved to /var/cache/conftool/dbconfig/20221111-170833-marostegui.json [17:23:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P39281 and previous config saved to /var/cache/conftool/dbconfig/20221111-172339-marostegui.json [17:24:47] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jhathaway) > Let me elaborate a little more on my experience in deployment-prep: > * I created a cloud server with cloud-init and my cloud public key, but was permanently... [17:24:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:34:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-tls [17:34:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-be [17:34:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe [17:38:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr bit of a heads up I'm hoping to get the migration kicked off for those Juniper Spine devices now that we've got the lic... [17:38:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T321130)', diff saved to https://phabricator.wikimedia.org/P39282 and previous config saved to /var/cache/conftool/dbconfig/20221111-173846-marostegui.json [17:38:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:38:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:39:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:39:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T321130)', diff saved to https://phabricator.wikimedia.org/P39283 and previous config saved to /var/cache/conftool/dbconfig/20221111-173907-marostegui.json [17:39:21] (03CR) 10JHathaway: [C: 03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [17:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321130)', diff saved to https://phabricator.wikimedia.org/P39284 and previous config saved to /var/cache/conftool/dbconfig/20221111-174121-marostegui.json [17:42:35] (03CR) 10JHathaway: [C: 03+1] "ranges look correct for aux, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:43:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:34] (03CR) 10JHathaway: [C: 03+1] "ranges look correct for aux, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:56:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39285 and previous config saved to /var/cache/conftool/dbconfig/20221111-175627-marostegui.json [18:01:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:11:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39286 and previous config saved to /var/cache/conftool/dbconfig/20221111-181134-marostegui.json [18:26:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T321130)', diff saved to https://phabricator.wikimedia.org/P39287 and previous config saved to /var/cache/conftool/dbconfig/20221111-182640-marostegui.json [18:26:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:26:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:26:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:30:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [18:31:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2098.codfw.wmnet with reason: Maintenance [18:35:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [18:35:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2100.codfw.wmnet with reason: Maintenance [18:39:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:40:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:40:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T321130)', diff saved to https://phabricator.wikimedia.org/P39288 and previous config saved to /var/cache/conftool/dbconfig/20221111-184017-marostegui.json [18:40:22] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:46:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321130)', diff saved to https://phabricator.wikimedia.org/P39289 and previous config saved to /var/cache/conftool/dbconfig/20221111-184633-marostegui.json [18:46:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:55:16] (03PS2) 10Eevans: Add component/gocql to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/855102 (https://phabricator.wikimedia.org/T283838) [18:56:16] (03CR) 10Eevans: [C: 03+2] Add component/gocql to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/855102 (https://phabricator.wikimedia.org/T283838) (owner: 10Eevans) [19:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P39290 and previous config saved to /var/cache/conftool/dbconfig/20221111-190139-marostegui.json [19:07:03] (03PS1) 10Sergio Gimeno: [Growth] Make Vue mentor dashboard default by removing GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 [19:08:51] (03PS2) 10Sergio Gimeno: GrowthExperiments: Make Vue mentor dashboard default by removing GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 [19:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P39291 and previous config saved to /var/cache/conftool/dbconfig/20221111-191646-marostegui.json [19:31:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321130)', diff saved to https://phabricator.wikimedia.org/P39292 and previous config saved to /var/cache/conftool/dbconfig/20221111-193152-marostegui.json [19:31:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:31:57] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:32:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:32:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T321130)', diff saved to https://phabricator.wikimedia.org/P39293 and previous config saved to /var/cache/conftool/dbconfig/20221111-193214-marostegui.json [19:35:24] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321130)', diff saved to https://phabricator.wikimedia.org/P39294 and previous config saved to /var/cache/conftool/dbconfig/20221111-193832-marostegui.json [19:38:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:41:00] (03CR) 10Dzahn: [C: 03+2] phabricator/aphlict: pass through ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) (owner: 10Dzahn) [19:46:19] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:48:19] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:51:10] (03CR) 10Dzahn: [C: 03+2] "noop confirmed. thanks. yea, so this is an improvement and a noop everywhere but that still doesn't remove the restart code and alert from" [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) (owner: 10Dzahn) [19:53:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P39295 and previous config saved to /var/cache/conftool/dbconfig/20221111-195338-marostegui.json [20:08:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P39296 and previous config saved to /var/cache/conftool/dbconfig/20221111-200845-marostegui.json [20:14:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T318605)', diff saved to https://phabricator.wikimedia.org/P39297 and previous config saved to /var/cache/conftool/dbconfig/20221111-201400-ladsgroup.json [20:14:05] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:20:39] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:57] RECOVERY - Check systemd state on phab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:03] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:05] !log phab1001,phab1004,phab2002 - systemctl reset-failed [20:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:32] (03CR) 10Dzahn: [C: 03+2] "it just needed a 'systemctl reset-failed' on the 3 phab hosts. icinga recovered. units don't exist anymore and puppet is not adding them b" [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) (owner: 10Dzahn) [20:23:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321130)', diff saved to https://phabricator.wikimedia.org/P39298 and previous config saved to /var/cache/conftool/dbconfig/20221111-202351-marostegui.json [20:23:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:23:56] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:24:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39299 and previous config saved to /var/cache/conftool/dbconfig/20221111-202413-marostegui.json [20:29:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P39300 and previous config saved to /var/cache/conftool/dbconfig/20221111-202906-ladsgroup.json [20:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39301 and previous config saved to /var/cache/conftool/dbconfig/20221111-203030-marostegui.json [20:30:35] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P39302 and previous config saved to /var/cache/conftool/dbconfig/20221111-204413-ladsgroup.json [20:45:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P39303 and previous config saved to /var/cache/conftool/dbconfig/20221111-204536-marostegui.json [20:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T318605)', diff saved to https://phabricator.wikimedia.org/P39304 and previous config saved to /var/cache/conftool/dbconfig/20221111-205919-ladsgroup.json [20:59:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:59:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:59:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [21:00:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P39305 and previous config saved to /var/cache/conftool/dbconfig/20221111-210043-marostegui.json [21:10:15] (03PS1) 10Dzahn: phabricator: add parameter for mysql port, set it to 3323 if using slave [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) [21:15:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39306 and previous config saved to /var/cache/conftool/dbconfig/20221111-211550-marostegui.json [21:15:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:15:56] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:16:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:16:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T321130)', diff saved to https://phabricator.wikimedia.org/P39307 and previous config saved to /var/cache/conftool/dbconfig/20221111-211611-marostegui.json [21:22:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321130)', diff saved to https://phabricator.wikimedia.org/P39308 and previous config saved to /var/cache/conftool/dbconfig/20221111-212239-marostegui.json [21:22:44] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P39309 and previous config saved to /var/cache/conftool/dbconfig/20221111-213745-marostegui.json [21:52:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P39310 and previous config saved to /var/cache/conftool/dbconfig/20221111-215252-marostegui.json [22:01:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321130)', diff saved to https://phabricator.wikimedia.org/P39311 and previous config saved to /var/cache/conftool/dbconfig/20221111-220758-marostegui.json [22:08:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:08:04] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:08:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:08:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T321130)', diff saved to https://phabricator.wikimedia.org/P39312 and previous config saved to /var/cache/conftool/dbconfig/20221111-220820-marostegui.json [22:09:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [22:09:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [22:09:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39313 and previous config saved to /var/cache/conftool/dbconfig/20221111-220939-ladsgroup.json [22:09:44] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321130)', diff saved to https://phabricator.wikimedia.org/P39314 and previous config saved to /var/cache/conftool/dbconfig/20221111-221441-marostegui.json [22:14:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:29:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P39315 and previous config saved to /var/cache/conftool/dbconfig/20221111-222948-marostegui.json [22:44:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P39316 and previous config saved to /var/cache/conftool/dbconfig/20221111-224454-marostegui.json [23:00:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321130)', diff saved to https://phabricator.wikimedia.org/P39317 and previous config saved to /var/cache/conftool/dbconfig/20221111-230000-marostegui.json [23:00:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:00:06] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:00:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:00:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:00:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T321130)', diff saved to https://phabricator.wikimedia.org/P39318 and previous config saved to /var/cache/conftool/dbconfig/20221111-230037-marostegui.json [23:06:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321130)', diff saved to https://phabricator.wikimedia.org/P39319 and previous config saved to /var/cache/conftool/dbconfig/20221111-230654-marostegui.json [23:07:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:16:23] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:22:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P39320 and previous config saved to /var/cache/conftool/dbconfig/20221111-232201-marostegui.json [23:36:23] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:37:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P39321 and previous config saved to /var/cache/conftool/dbconfig/20221111-233707-marostegui.json [23:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321130)', diff saved to https://phabricator.wikimedia.org/P39322 and previous config saved to /var/cache/conftool/dbconfig/20221111-235214-marostegui.json [23:52:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:52:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:52:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [23:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39323 and previous config saved to /var/cache/conftool/dbconfig/20221111-235235-marostegui.json [23:59:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39324 and previous config saved to /var/cache/conftool/dbconfig/20221111-235902-marostegui.json [23:59:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130