[00:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P39325 and previous config saved to /var/cache/conftool/dbconfig/20221112-001408-marostegui.json [00:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P39326 and previous config saved to /var/cache/conftool/dbconfig/20221112-002915-marostegui.json [00:43:49] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39327 and previous config saved to /var/cache/conftool/dbconfig/20221112-004422-marostegui.json [00:44:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:44:28] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:44:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39328 and previous config saved to /var/cache/conftool/dbconfig/20221112-004443-marostegui.json [00:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39329 and previous config saved to /var/cache/conftool/dbconfig/20221112-005107-marostegui.json [00:51:15] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:54:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:58:07] PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:04:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P39330 and previous config saved to /var/cache/conftool/dbconfig/20221112-010615-marostegui.json [01:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P39331 and previous config saved to /var/cache/conftool/dbconfig/20221112-012122-marostegui.json [01:36:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T321130)', diff saved to https://phabricator.wikimedia.org/P39332 and previous config saved to /var/cache/conftool/dbconfig/20221112-013628-marostegui.json [01:36:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2182.codfw.wmnet with reason: Maintenance [01:36:34] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:36:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2182.codfw.wmnet with reason: Maintenance [01:36:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T321130)', diff saved to https://phabricator.wikimedia.org/P39333 and previous config saved to /var/cache/conftool/dbconfig/20221112-013650-marostegui.json [01:38:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321130)', diff saved to https://phabricator.wikimedia.org/P39334 and previous config saved to /var/cache/conftool/dbconfig/20221112-014308-marostegui.json [01:43:14] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:48:52] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P39335 and previous config saved to /var/cache/conftool/dbconfig/20221112-015814-marostegui.json [02:08:52] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P39336 and previous config saved to /var/cache/conftool/dbconfig/20221112-021321-marostegui.json [02:18:52] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [02:25:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [02:25:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T318605)', diff saved to https://phabricator.wikimedia.org/P39337 and previous config saved to /var/cache/conftool/dbconfig/20221112-022535-ladsgroup.json [02:25:39] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T321130)', diff saved to https://phabricator.wikimedia.org/P39338 and previous config saved to /var/cache/conftool/dbconfig/20221112-022827-marostegui.json [02:28:32] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:05:29] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 202 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:07:29] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39339 and previous config saved to /var/cache/conftool/dbconfig/20221112-034618-ladsgroup.json [03:46:23] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P39340 and previous config saved to /var/cache/conftool/dbconfig/20221112-040124-ladsgroup.json [04:16:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P39341 and previous config saved to /var/cache/conftool/dbconfig/20221112-041631-ladsgroup.json [04:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T318605)', diff saved to https://phabricator.wikimedia.org/P39342 and previous config saved to /var/cache/conftool/dbconfig/20221112-043137-ladsgroup.json [04:31:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:31:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:31:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:31:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:31:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:32:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T318605)', diff saved to https://phabricator.wikimedia.org/P39343 and previous config saved to /var/cache/conftool/dbconfig/20221112-043203-ladsgroup.json [07:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T318605)', diff saved to https://phabricator.wikimedia.org/P39344 and previous config saved to /var/cache/conftool/dbconfig/20221112-074042-ladsgroup.json [07:40:47] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39345 and previous config saved to /var/cache/conftool/dbconfig/20221112-075548-ladsgroup.json [08:10:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39346 and previous config saved to /var/cache/conftool/dbconfig/20221112-081055-ladsgroup.json [08:26:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T318605)', diff saved to https://phabricator.wikimedia.org/P39347 and previous config saved to /var/cache/conftool/dbconfig/20221112-082601-ladsgroup.json [08:26:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [08:26:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:26:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [08:26:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T318605)', diff saved to https://phabricator.wikimedia.org/P39348 and previous config saved to /var/cache/conftool/dbconfig/20221112-082623-ladsgroup.json [09:11:41] RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [09:26:01] PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T318605)', diff saved to https://phabricator.wikimedia.org/P39349 and previous config saved to /var/cache/conftool/dbconfig/20221112-101306-ladsgroup.json [10:13:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P39350 and previous config saved to /var/cache/conftool/dbconfig/20221112-102812-ladsgroup.json [10:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P39351 and previous config saved to /var/cache/conftool/dbconfig/20221112-104319-ladsgroup.json [10:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T318605)', diff saved to https://phabricator.wikimedia.org/P39352 and previous config saved to /var/cache/conftool/dbconfig/20221112-105825-ladsgroup.json [10:58:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [10:58:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:58:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [10:58:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T318605)', diff saved to https://phabricator.wikimedia.org/P39353 and previous config saved to /var/cache/conftool/dbconfig/20221112-105847-ladsgroup.json [12:09:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:14:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:44:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T318605)', diff saved to https://phabricator.wikimedia.org/P39354 and previous config saved to /var/cache/conftool/dbconfig/20221112-135721-ladsgroup.json [13:57:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:57:43] (03PS1) 10Ladsgroup: Add w/api/index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) [14:12:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P39355 and previous config saved to /var/cache/conftool/dbconfig/20221112-141227-ladsgroup.json [14:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P39356 and previous config saved to /var/cache/conftool/dbconfig/20221112-142734-ladsgroup.json [14:30:49] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:49] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:40:53] RECOVERY - cassandra-a CQL 10.64.0.199:9042 on aqs1016 is OK: TCP OK - 0.000 second response time on 10.64.0.199 port 9042 https://phabricator.wikimedia.org/T93886 [14:42:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T318605)', diff saved to https://phabricator.wikimedia.org/P39357 and previous config saved to /var/cache/conftool/dbconfig/20221112-144240-ladsgroup.json [14:42:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [14:42:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:42:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [14:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T318605)', diff saved to https://phabricator.wikimedia.org/P39358 and previous config saved to /var/cache/conftool/dbconfig/20221112-144302-ladsgroup.json [16:31:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T318605)', diff saved to https://phabricator.wikimedia.org/P39359 and previous config saved to /var/cache/conftool/dbconfig/20221112-163124-ladsgroup.json [16:31:29] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P39360 and previous config saved to /var/cache/conftool/dbconfig/20221112-164630-ladsgroup.json [17:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P39361 and previous config saved to /var/cache/conftool/dbconfig/20221112-170137-ladsgroup.json [17:10:56] (03CR) 10Tacsipacsi: Add w/api/index.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856030 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [17:16:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T318605)', diff saved to https://phabricator.wikimedia.org/P39362 and previous config saved to /var/cache/conftool/dbconfig/20221112-171643-ladsgroup.json [17:16:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:16:49] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:16:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T318605)', diff saved to https://phabricator.wikimedia.org/P39363 and previous config saved to /var/cache/conftool/dbconfig/20221112-171705-ladsgroup.json [17:32:09] !log uploaded python3-gjson_0.4.0 to apt.wikimedia.org bullseye-wikimedia [17:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:35] PROBLEM - Host db2173 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:31] PROBLEM - MariaDB Replica IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2173.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2173.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:52:53] 10SRE: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T322968 (10Ganeshnanjaraj) [18:59:31] PROBLEM - MariaDB Replica Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1162.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:49:15] (03PS6) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [19:52:18] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38115/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:05:29] (03PS7) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [20:06:28] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38116/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:13:01] (03CR) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:18:11] (03PS2) 10Andrea Denisse: netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) [20:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T318605)', diff saved to https://phabricator.wikimedia.org/P39364 and previous config saved to /var/cache/conftool/dbconfig/20221112-202007-ladsgroup.json [20:20:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:21:14] (03CR) 10Andrea Denisse: netmon: Add netmon2002 to the alertmanager rw api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:22:31] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38117/console" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:24:38] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38118/console" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:35:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P39365 and previous config saved to /var/cache/conftool/dbconfig/20221112-203514-ladsgroup.json [20:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P39366 and previous config saved to /var/cache/conftool/dbconfig/20221112-205020-ladsgroup.json [21:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T318605)', diff saved to https://phabricator.wikimedia.org/P39367 and previous config saved to /var/cache/conftool/dbconfig/20221112-210527-ladsgroup.json [21:05:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:05:32] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:05:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:46:58] !log initiating bootstrap, aqs1016-b -- T307802 [22:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:03] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [22:48:49] RECOVERY - cassandra-b service on aqs1016 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:49:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T318605)', diff saved to https://phabricator.wikimedia.org/P39368 and previous config saved to /var/cache/conftool/dbconfig/20221112-224900-ladsgroup.json [22:49:09] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:50:01] RECOVERY - cassandra-b SSL 10.64.0.213:7001 on aqs1016 is OK: SSL OK - Certificate aqs1016-b valid until 2024-11-08 15:06:18 +0000 (expires in 726 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [23:04:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P39369 and previous config saved to /var/cache/conftool/dbconfig/20221112-230407-ladsgroup.json [23:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P39370 and previous config saved to /var/cache/conftool/dbconfig/20221112-231914-ladsgroup.json [23:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T318605)', diff saved to https://phabricator.wikimedia.org/P39371 and previous config saved to /var/cache/conftool/dbconfig/20221112-233420-ladsgroup.json [23:34:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:36:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [23:39:15] PROBLEM - Freshness of OCSP Stapling files -HAProxy- on cp1086 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2022-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [23:41:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [23:45:41] PROBLEM - Freshness of OCSP Stapling files -HAProxy- on cp1083 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2021-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [23:59:42] (03PS1) 10Andrea Denisse: Lower the TTL for netbox for the migration. [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523)