[00:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:09:12] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:12:42] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[00:17:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1029.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:17:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1055.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:18:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1099.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:41:26] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:43:10] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:02:56] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1140 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:48] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[02:38:36] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2099 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:44:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:27:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:32:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:29:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[05:30:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[05:30:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[05:30:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[05:30:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43439 and previous config saved to /var/cache/conftool/dbconfig/20230130-053033-ladsgroup.json
[05:30:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:07:13] <wikibugs>	 (03CR) 10Winston Sung: [C: 03+1] Update cxserver to 2023-01-23-123356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry)
[06:11:43] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[06:13:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[06:13:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[06:14:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43440 and previous config saved to /var/cache/conftool/dbconfig/20230130-061401-ladsgroup.json
[06:14:06] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:15:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[06:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[06:15:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T318605)', diff saved to https://phabricator.wikimedia.org/P43441 and previous config saved to /var/cache/conftool/dbconfig/20230130-061534-ladsgroup.json
[06:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:34:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:34:24] <marostegui>	 !log dbmaint Schema change on s6 eqiad T328086
[06:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:28] <stashbot>	 T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086
[06:35:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[06:35:31] <stashbot>	 T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022
[06:36:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[06:36:27] <marostegui>	 Amir1: Any chances you can stop your maintenance on s4?
[06:36:36] <marostegui>	 I need to switchover s4 codfw master for the switches upgrade
[06:36:37] <Amir1>	 marostegui: only with bribe
[06:36:49] <marostegui>	 Amir1: If you stop it I promise you I won't bring s4 down
[06:36:55] <marostegui>	 is that good enough??
[06:36:56] <Amir1>	 sold
[06:37:18] <marostegui>	 \o/
[06:37:29] <marostegui>	 Amir1: it shouldn't take long :)
[06:37:45] <Amir1>	 let me know once done
[06:37:48] <marostegui>	 will do
[06:37:54] <marostegui>	 can I restart replication on db2140?
[06:37:58] <Amir1>	 actually I'm running alter table on one of them, would that impact it?
[06:38:08] <marostegui>	 will it take long?
[06:38:16] <Amir1>	 let me take a look
[06:38:33] <marostegui>	 is it running on db2140?
[06:38:43] <Amir1>	 yeah, it's externallinks
[06:38:54] <marostegui>	 yeah, so I need it to get finished before I can proceed
[06:39:51] <Amir1>	 marostegui: it's an alter table, if you kill it it's fine, I'll restart it, just ping me once done. Would that be okay?
[06:39:55] <marostegui>	 no no
[06:39:57] <marostegui>	 it is fine
[06:39:58] <marostegui>	 I can wait
[06:40:11] <Amir1>	 let me see how long it'll take
[06:41:13] <marostegui>	 !log dbmaint Schema change on s8 eqiad T328086
[06:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:17] <Amir1>	 I have some bad news, it's going to take 6 hours
[06:41:17] <stashbot>	 T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086
[06:41:25] <marostegui>	 Amir1: that is ok
[06:43:05] <marostegui>	 !log dbmaint Schema change on s7 eqiad T328086
[06:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:37] <marostegui>	 !log dbmaint Schema change on s2 eqiad T328086
[06:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:22] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Move db1195 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/884722 (https://phabricator.wikimedia.org/T327995)
[06:51:03] <marostegui>	 !log dbmaint Schema change on s5 eqiad T328086
[06:51:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Move db1195 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/884722 (https://phabricator.wikimedia.org/T327995) (owner: 10Marostegui)
[06:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:07] <stashbot>	 T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086
[06:52:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43443 and previous config saved to /var/cache/conftool/dbconfig/20230130-065247-ladsgroup.json
[06:52:51] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:55:44] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:55:56] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:56:07] <marostegui>	 ^ me
[06:58:00] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:58:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui)
[06:58:34] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui)
[06:58:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) The task gets generated fine, but still a bit unreadable as show on T328135 Leaving this task open until @MoritzMuehlenhoff takes a look...
[06:59:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43444 and previous config saved to /var/cache/conftool/dbconfig/20230130-065943-ladsgroup.json
[06:59:48] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:59:48] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:01:16] <marostegui>	 !log dbmaint Schema change on s4 eqiad T328086
[07:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:20] <stashbot>	 T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086
[07:02:27] <marostegui>	 !log dbmaint Schema change on s1 eqiad T328086
[07:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:48] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:05:14] <marostegui>	 !log dbmaint Schema change on s3 eqiad T328086
[07:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:22] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:05:34] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:05:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:05:38] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:07:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43445 and previous config saved to /var/cache/conftool/dbconfig/20230130-070753-ladsgroup.json
[07:10:08] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:10:31] <marostegui>	 !log dbmaint Schema change on s8 eqiad T328236
[07:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:34] <stashbot>	 T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236
[07:11:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:11:28] <marostegui>	 !log dbmaint Schema change on s5 eqiad T328236
[07:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:14:32] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:14:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43446 and previous config saved to /var/cache/conftool/dbconfig/20230130-071450-ladsgroup.json
[07:16:55] <marostegui>	 !log dbmaint Schema change on s6 eqiad T328236
[07:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:59] <stashbot>	 T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236
[07:17:52] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[07:17:57] <marostegui>	 !log dbmaint Schema change on s4 eqiad T328236
[07:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:47] <marostegui>	 !log dbmaint Schema change on s1 eqiad T328236
[07:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43447 and previous config saved to /var/cache/conftool/dbconfig/20230130-072300-ladsgroup.json
[07:25:26] <marostegui>	 !log dbmaint Schema change on s1 eqiad T328236
[07:25:28] <marostegui>	 !log dbmaint Schema change on s2 eqiad T328236
[07:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:32] <stashbot>	 T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236
[07:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:29] <marostegui>	 !log dbmaint Schema change on s7 eqiad T328236
[07:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43448 and previous config saved to /var/cache/conftool/dbconfig/20230130-072956-ladsgroup.json
[07:32:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:38:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43449 and previous config saved to /var/cache/conftool/dbconfig/20230130-073806-ladsgroup.json
[07:38:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[07:38:11] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:38:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[07:38:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43450 and previous config saved to /var/cache/conftool/dbconfig/20230130-073827-ladsgroup.json
[07:45:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43451 and previous config saved to /var/cache/conftool/dbconfig/20230130-074502-ladsgroup.json
[07:45:08] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:46:14] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza)
[07:48:39] <moritzm>	 T327867!log installing install2004 
[07:48:40] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[07:50:15] <moritzm>	 !log installing install2004 T327867
[07:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:22] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@bfbd6d7]: (no justification provided)
[07:54:28] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bfbd6d7]: (no justification provided) (duration: 00m 05s)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:15] <Amir1>	 no gerrit patches :)
[08:00:46] * zabe is going to deploy a sec patch
[08:01:04] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43452 and previous config saved to /var/cache/conftool/dbconfig/20230130-081011-ladsgroup.json
[08:10:21] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:19:21] <logmsgbot>	 !log zabe: Deployed security patch for T278365
[08:23:02] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:25:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43454 and previous config saved to /var/cache/conftool/dbconfig/20230130-082517-ladsgroup.json
[08:28:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:28:44] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:29:23] <wikibugs>	 (03PS1) 10Jcrespo: Add the "very_stale" HTML style as a red label [software/pampinus] - 10https://gerrit.wikimedia.org/r/884820
[08:30:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:30:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:30:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43455 and previous config saved to /var/cache/conftool/dbconfig/20230130-083034-ladsgroup.json
[08:30:39] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[08:39:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] Enable Linter write namespace, tag and template from core, group0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[08:40:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43456 and previous config saved to /var/cache/conftool/dbconfig/20230130-084024-ladsgroup.json
[08:40:43] <wikibugs>	 (03CR) 10Marostegui: "I am trying to think a good way to deploy this safely. The change looks good, but maybe we should disable puppet on all databases, get thi" [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[08:42:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43457 and previous config saved to /var/cache/conftool/dbconfig/20230130-084213-ladsgroup.json
[08:42:17] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[08:48:46] <moritzm>	 !log installing install1004 T327867
[08:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:50] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[08:51:06] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[08:53:00] <wikibugs>	 (03CR) 10Marostegui: mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[08:55:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43458 and previous config saved to /var/cache/conftool/dbconfig/20230130-085530-ladsgroup.json
[08:55:35] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:56:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, minus the fact you also need to add the services to the allowed_listeners list below (see comment inline)" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[08:57:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43459 and previous config saved to /var/cache/conftool/dbconfig/20230130-085719-ladsgroup.json
[08:57:52] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack)
[09:12:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43460 and previous config saved to /var/cache/conftool/dbconfig/20230130-091225-ladsgroup.json
[09:17:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39297/console" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack)
[09:18:18] <wikibugs>	 (03CR) 10Clément Goubert: "LGTM, question inline," [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto)
[09:19:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron)
[09:19:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack)
[09:19:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove recording rule for CPU benchmark. [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog)
[09:22:14] <wikibugs>	 (03PS2) 10Awight: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal)
[09:23:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert)
[09:23:31] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883933 (https://phabricator.wikimedia.org/T328015) (owner: 10Clément Goubert)
[09:25:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Hi, and thanks for taking this on. In fact, we have a task dedicated to this problem, https://phabricator.wikimedia.org/T292818, and I'm w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight)
[09:25:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) a:05Clement_Goubert→03herron Handing off to this week's Clinic Duty SRE. @herron you should just have to merge the CR and create the...
[09:27:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43461 and previous config saved to /var/cache/conftool/dbconfig/20230130-092732-ladsgroup.json
[09:27:33] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal)
[09:27:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[09:27:37] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[09:27:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[09:28:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43462 and previous config saved to /var/cache/conftool/dbconfig/20230130-092804-ladsgroup.json
[09:28:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "A recommendation inline re: readability, otherwise LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto)
[09:29:01] <jynus>	 !log disabling puppet on dbprov2004 to reorganize partitions T327155
[09:29:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:04] <stashbot>	 T327155: Setup dbprov1004 an dbprov2004 as an expansion of the dbprov (database provisioning) cluster, in preparation of binlog backups backup implementation - https://phabricator.wikimedia.org/T327155
[09:32:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto)
[09:38:11] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:38:17] <wikibugs>	 (03PS3) 10Awight: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal)
[09:39:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43463 and previous config saved to /var/cache/conftool/dbconfig/20230130-093941-ladsgroup.json
[09:39:46] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[09:39:51] <wikibugs>	 (03PS3) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926)
[09:40:04] <wikibugs>	 (03CR) 10Btullis: Update the spark images to remove upstream support for the webhook (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[09:40:48] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[09:44:51] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: update ml-serve-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:47:01] <wikibugs>	 (03CR) 10Klausman: role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:47:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change LGTM; the check_all_memcached.php nagios check is now unused and I'll remove it. I'll rebase this patch on top of that change a" [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy)
[09:47:25] <awight>	 Is there anything usual happening with the SSH bastions?  I'm having no luck logging in through bast1003 or bast3006.
[09:47:42] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Ladsgroup) Mailman really doesn't have an owner yet. Kunal and I did just the upgrade from 2 to 3 due its severe limitations and security issues. I have way...
[09:48:11] <awight>	 *unusual
[09:49:03] <godog>	 awight: not afaict, I'm using bast3006 and it works
[09:49:32] <taavi>	 awight: are you getting some error messages? what does your ssh config look like?
[09:49:36] <awight>	 Thanks for the confirmation!  Something's happening now but *very* slowly, it must be my network.
[09:49:44] <awight>	 (and I'm finally in)
[09:49:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If we let puppet pick systemd as the agent, then we also need to probably change the restart command to be a systemd-driven reload." [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[09:51:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy)
[09:51:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal)
[09:52:05] <wikibugs>	 (03Merged) 10jenkins-bot: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal)
[09:52:13] <XioNoX>	 !log push pfw policies - T328085
[09:52:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:23] <logmsgbot>	 !log awight@deploy1002 Started scap: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]]
[09:52:26] <stashbot>	 T326317: Deploy geoshape expansion to wikis - https://phabricator.wikimedia.org/T326317
[09:54:05] <logmsgbot>	 !log awight@deploy1002 lilients and awight: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[09:54:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43464 and previous config saved to /var/cache/conftool/dbconfig/20230130-095447-ladsgroup.json
[09:59:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[09:59:29] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): Rack new cloud-dev servers in same rack - https://phabricator.wikimedia.org/T267662 (10ayounsi)
[10:00:17] <logmsgbot>	 !log awight@deploy1002 Finished scap: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]] (duration: 07m 53s)
[10:00:21] <stashbot>	 T326317: Deploy geoshape expansion to wikis - https://phabricator.wikimedia.org/T326317
[10:02:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) I don't have any issue with that. Cabling is at your discretion.
[10:04:06] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 49544
[10:08:29] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-01-26 00:09:59 Jcrespo rerunning after refactoring issues - The acknowledgement expires at: 2023-01-31 07:05:37. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:08:29] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2139) taken more than 3 days ago: Most recent backup 2023-01-25 11:41:40 Jcrespo rerunning after refactoring issues - The acknowledgement expires at: 2023-01-31 07:05:37. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43465 and previous config saved to /var/cache/conftool/dbconfig/20230130-100954-ladsgroup.json
[10:11:16] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, I'll import the actual secrets to private puppet in a moment" [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:11:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49544
[10:15:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 14593
[10:16:31] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582)
[10:16:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4003.wikimedia.org
[10:17:18] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts bast4003.wikimedia.org
[10:17:21] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582)
[10:17:32] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14593
[10:19:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[10:19:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove previos bastions from bastion_host list [puppet] - 10https://gerrit.wikimedia.org/r/884832 (https://phabricator.wikimedia.org/T324974)
[10:20:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[10:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:21:01] <wikibugs>	 (03PS1) 10FNegri: P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273)
[10:21:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove previos bastions from bastion_host list [puppet] - 10https://gerrit.wikimedia.org/r/884832 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff)
[10:25:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43466 and previous config saved to /var/cache/conftool/dbconfig/20230130-102500-ladsgroup.json
[10:25:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:25:05] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[10:25:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:25:20] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "I added the files to private puppet" [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:26:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Cumin alias for bastion canary [puppet] - 10https://gerrit.wikimedia.org/r/884834
[10:27:18] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "update command needs to be run after deploy so it gets sent from the director to the storage daemons." [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[10:28:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin alias for bastion canary [puppet] - 10https://gerrit.wikimedia.org/r/884834 (owner: 10Muehlenhoff)
[10:29:02] <wikibugs>	 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki)
[10:29:46] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10jijiki) 05Open→03Resolved This work is done
[10:30:50] <wikibugs>	 (03PS1) 10Ladsgroup: Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662)
[10:30:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4003.wikimedia.org
[10:31:39] <wikibugs>	 (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah)
[10:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:33:17] <wikibugs>	 (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah)
[10:33:54] <wikibugs>	 (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah)
[10:34:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:34:26] <Amir1>	 jouncebot: nowandnext
[10:34:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 25 minute(s)
[10:34:26] <jouncebot>	 In 0 hour(s) and 25 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1100)
[10:34:33] <Amir1>	 good
[10:34:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[10:35:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[10:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:35:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43467 and previous config saved to /var/cache/conftool/dbconfig/20230130-103540-ladsgroup.json
[10:35:44] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[10:36:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:36:18] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]]
[10:36:20] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] jenkins: add secrets for releasing instance [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:36:22] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[10:37:56] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[10:38:20] <wikibugs>	 (03PS4) 10Thiemo Kreuz (WMDE): Remove some unused LAMP config [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (https://phabricator.wikimedia.org/T292818) (owner: 10Awight)
[10:38:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273) (owner: 10FNegri)
[10:40:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:43:12] <Amir1>	 okay, tested in a wiki in s5, s6 and s1, the replication didn't break
[10:43:16] <Amir1>	 moving forward
[10:46:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:46:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:46:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast4003.wikimedia.org
[10:46:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast4003.wikimedia.org` - bast4003.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager...
[10:47:06] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) >>! In T306995#8128358, @Michael wrote: > Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata...
[10:47:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43468 and previous config saved to /var/cache/conftool/dbconfig/20230130-104735-ladsgroup.json
[10:47:40] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[10:48:42] <wikibugs>	 10SRE, 10Data-Engineering, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10kostajh) I'm writing th...
[10:49:29] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]] (duration: 13m 10s)
[10:49:33] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[10:51:48] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10elukey)
[10:54:37] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273) (owner: 10FNegri)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1100)
[11:01:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org
[11:01:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:02:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43470 and previous config saved to /var/cache/conftool/dbconfig/20230130-110241-ladsgroup.json
[11:03:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4002.wikimedia.org - jmm@cumin2002"
[11:04:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast4003 [puppet] - 10https://gerrit.wikimedia.org/r/884845 (https://phabricator.wikimedia.org/T324974)
[11:04:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4002.wikimedia.org - jmm@cumin2002"
[11:04:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:04:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4002.wikimedia.org on all recursors
[11:04:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install4002.wikimedia.org on all recursors
[11:05:21] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
[11:06:05] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802)
[11:06:10] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[11:09:07] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@4e5ff3f]: (no justification provided)
[11:09:12] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@4e5ff3f]: (no justification provided) (duration: 00m 05s)
[11:11:59] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
[11:12:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast4003 [puppet] - 10https://gerrit.wikimedia.org/r/884845 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff)
[11:17:21] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:24] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet
[11:17:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43471 and previous config saved to /var/cache/conftool/dbconfig/20230130-111748-ladsgroup.json
[11:19:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install4002.wikimedia.org
[11:22:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add install4002 [puppet] - 10https://gerrit.wikimedia.org/r/884854
[11:24:33] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet
[11:24:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add install4002 [puppet] - 10https://gerrit.wikimedia.org/r/884854 (owner: 10Muehlenhoff)
[11:27:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] httpd: Let Puppet pick the init provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:27:41] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 132, down: 43, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:28:10] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet
[11:31:13] <wikibugs>	 (03PS1) 10Ladsgroup: Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856
[11:32:06] <wikibugs>	 (03CR) 10Muehlenhoff: httpd: Let Puppet pick the init provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43472 and previous config saved to /var/cache/conftool/dbconfig/20230130-113254-ladsgroup.json
[11:32:55] <wikibugs>	 (03PS5) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850)
[11:32:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:32:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:32:59] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[11:33:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:33:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[11:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[11:33:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43473 and previous config saved to /var/cache/conftool/dbconfig/20230130-113319-ladsgroup.json
[11:33:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856 (owner: 10Ladsgroup)
[11:35:03] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:12] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet
[11:35:28] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet
[11:35:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[11:40:13] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:31] <Amir1>	 !log dropping old wikiadmin user (T326802)
[11:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:35] <stashbot>	 T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802
[11:42:22] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet
[11:42:57] <moritzm>	 !log installing install4002 T327867
[11:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:01] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[11:43:23] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856 (owner: 10Ladsgroup)
[11:44:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43474 and previous config saved to /var/cache/conftool/dbconfig/20230130-114424-ladsgroup.json
[11:44:29] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[11:48:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 42473
[11:49:42] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42473
[11:49:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast6001.wikimedia.org
[11:51:42] <wikibugs>	 (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[11:54:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:56:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[11:57:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[11:57:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:57:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast6001.wikimedia.org
[11:57:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast6001.wikimedia.org` - bast6001.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager...
[11:58:21] <wikibugs>	 (03PS1) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759)
[11:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43475 and previous config saved to /var/cache/conftool/dbconfig/20230130-115930-ladsgroup.json
[12:04:24] <wikibugs>	 (03PS2) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759)
[12:04:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:06:48] <nemo-yiannis>	 Hi, there is a sec patch on wikifeeds waiting for deployment. Is it OK to deploy now ?
[12:07:06] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Move ssh-key-ldap-lookup to profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/880883 (owner: 10Muehlenhoff)
[12:07:13] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/880884 (owner: 10Muehlenhoff)
[12:07:35] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:11:34] <Lucas_WMDE>	 jouncebot: now
[12:11:34] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 48 minute(s)
[12:11:52] <Lucas_WMDE>	 nemo-yiannis: I think you can probably deploy
[12:12:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3004.wikimedia.org
[12:12:41] <nemo-yiannis>	 thanks Lucas_WMDE 
[12:12:59] <Lucas_WMDE>	 (since I’m not seeing any objections ^^)
[12:13:02] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110 (owner: 10PipelineBot)
[12:14:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43476 and previous config saved to /var/cache/conftool/dbconfig/20230130-121437-ladsgroup.json
[12:16:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes: Increase inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[12:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110 (owner: 10PipelineBot)
[12:18:43] <wikibugs>	 (03CR) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[12:22:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:23:05] <logmsgbot>	 !log awight@deploy1002 Started deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad
[12:24:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/884876
[12:25:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:25:49] <logmsgbot>	 !log awight@deploy1002 Finished deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad (duration: 02m 44s)
[12:26:48] <wikibugs>	 10SRE, 10CommRel-Specialists-Support, 10serviceops, 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert)
[12:27:19] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2006 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:27:19] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:27:19] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:27:49] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1007 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:27:49] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:27:49] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:27:57] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2009 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:27:57] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:27:57] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:28:15] <icinga-wm>	 PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 4
[12:28:15] <icinga-wm>	 cting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geol
[12:28:15] <icinga-wm>	 eojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[12:28:33] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1010 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:28:33] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:28:33] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:28:33] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1008 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:28:34] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:28:34] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:28:39] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:28:39] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:28:39] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:28:45] <wikibugs>	 (03CR) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[12:29:26] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1006 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:29:26] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:29:26] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:29:33] <claime>	 awight: ^ expected?
[12:29:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle)
[12:29:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43477 and previous config saved to /var/cache/conftool/dbconfig/20230130-122943-ladsgroup.json
[12:29:44] <wikibugs>	 (03CR) 10Jbond: "will abandon this chnage as it no longer seems neccesary" [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond)
[12:29:44] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2008 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:29:44] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:29:44] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:29:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:29:48] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[12:29:48] <wikibugs>	 (03Abandoned) 10Jbond: blackbox::check::http: change expiry check value from days to seconds [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond)
[12:29:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:30:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43478 and previous config saved to /var/cache/conftool/dbconfig/20230130-123004-ladsgroup.json
[12:30:18] <wikibugs>	 (03CR) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[12:33:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::grid: install python3-mwparserfromhell [puppet] - 10https://gerrit.wikimedia.org/r/882220 (https://phabricator.wikimedia.org/T327600) (owner: 10Majavah)
[12:35:14] <icinga-wm>	 PROBLEM - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 4
[12:35:14] <icinga-wm>	 cting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geol
[12:35:14] <icinga-wm>	 eojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[12:35:14] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2005 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting
[12:35:15] <icinga-wm>	 /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso
[12:35:15] <icinga-wm>	 {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:41:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43479 and previous config saved to /var/cache/conftool/dbconfig/20230130-124142-ladsgroup.json
[12:41:47] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[12:44:17] <claime>	 awight: Ping? Are the above Kartotherian errors related to Finished deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad (duration: 02m 44s) ?
[12:44:20] <wikibugs>	 (03PS4) 10Jbond: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:44:41] <awight>	 claime: Yes definitely the fault of this deployment.  I'll roll back now.
[12:44:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:44:52] <wikibugs>	 (03PS4) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035)
[12:44:55] <wikibugs>	 (03PS5) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035)
[12:45:08] <logmsgbot>	 !log awight@deploy1002 Started deploy [kartotherian/deploy@5c58f8f]: Roll back kartotherian
[12:46:35] <logmsgbot>	 !log awight@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f]: Roll back kartotherian (duration: 01m 27s)
[12:48:04] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1006 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:48:04] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1005 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:48:04] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps2007 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[12:48:52] <awight>	 claime: Rolled back now--are the alerts any healthier?
[12:49:27] <claime>	 No more 301s but I'm still getting 400 for osm-intl/info.json 
[12:49:49] <awight>	 Which URL?  I see https://maps.wikimedia.org/osm-intl/info.json is responding correctly from the browser.
[12:52:11] <wikibugs>	 (03PS1) 10Awight: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113)
[12:53:41] <wikibugs>	 (03PS5) 10Jbond: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:53:43] <wikibugs>	 (03PS1) 10Jbond: Puppetfile: fix whitespace issue on puppetfile [puppet] - 10https://gerrit.wikimedia.org/r/884880
[12:53:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[12:54:00] <wikibugs>	 (03PS2) 10Awight: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113)
[12:54:06] <claime>	 awight: They're the service checks 
[12:54:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Puppetfile: fix whitespace issue on puppetfile [puppet] - 10https://gerrit.wikimedia.org/r/884880 (owner: 10Jbond)
[12:54:09] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[12:54:11] <wikibugs>	 (03PS3) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759)
[12:54:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[12:55:12] <logmsgbot>	 !log awight@deploy1002 Started scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]]
[12:55:13] <logmsgbot>	 !log awight@deploy1002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --force-version "1.40.0-wmf.20" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.2oaGSEpQR1"' returned non-zero exit status 255. (duration: 00m 00s)
[12:55:17] <stashbot>	 T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113
[12:55:19] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:55:43] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:55:58] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] "Just for bookkeeping." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[12:56:25] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:56:44] <wikibugs>	 (03CR) 10Jbond: "lgtm see question" [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:56:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43481 and previous config saved to /var/cache/conftool/dbconfig/20230130-125648-ladsgroup.json
[12:57:02] <awight>	 dancy: ^ odd scap backport issue in the logs above
[12:57:08] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[12:57:15] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:57:17] <wikibugs>	 (03CR) 10Jbond: redfish: store all manager info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[12:57:43] <wikibugs>	 (03CR) 10Jbond: redfish: store all manager info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[12:58:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:58:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:58:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3004.wikimedia.org
[12:58:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3004.wikimedia.org` - bast3004.wikimedia.org (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[12:58:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8565480, @Papaul wrote: > @cmooney this looks good to me just one question. Is...
[12:59:28] <taavi>	 _joe_: Krinkle: https://gerrit.wikimedia.org/r/c/operations/puppet/+/880561 might be related with the scap errors above
[12:59:30] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:00:14] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:00:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast3004/bast6001 [puppet] - 10https://gerrit.wikimedia.org/r/884882 (https://phabricator.wikimedia.org/T325361)
[13:02:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3004/bast6001 [puppet] - 10https://gerrit.wikimedia.org/r/884882 (https://phabricator.wikimedia.org/T325361) (owner: 10Muehlenhoff)
[13:02:49] <awight>	 claime: If this is in the "3/3 HARD" failure state, does it require manual intervention to refresh the checks?
[13:03:04] <wikibugs>	 (03PS2) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:03:12] <claime>	 awight: I've already forced them once, I'll retry
[13:06:01] <wikibugs>	 (03Abandoned) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[13:07:39] <_joe_>	 taavi: uh I did test a maintenance script :/
[13:08:50] <_joe_>	 sorry I was at lunch
[13:09:02] <_joe_>	 awight: are you waiting for a fix?
[13:09:53] <awight>	 _joe_: No worries.  I have a half-deployed revert but mostly just happy to hear that my breakage is limited to maps.
[13:10:09] <claime>	 I'm trying to debug rn but I can't find the inciga checks
[13:10:11] <awight>	 So don't rush, but please do ping me when the script is fixed.
[13:10:27] <claime>	 I have one that uses service-checker-swagger
[13:10:47] <awight>	 claime: In case it's helpful, https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=maps1005&service=kartotherian+endpoints+health
[13:10:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: Add auto_prepend_file to PHP config_cli" [puppet] - 10https://gerrit.wikimedia.org/r/884497
[13:10:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "mediawiki: Add auto_prepend_file to PHP config_cli" [puppet] - 10https://gerrit.wikimedia.org/r/884497 (owner: 10Giuseppe Lavagetto)
[13:11:52] <_joe_>	 puppet is running
[13:11:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43482 and previous config saved to /var/cache/conftool/dbconfig/20230130-131155-ladsgroup.json
[13:12:06] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:43] <_joe_>	 awight: green light!
[13:13:04] <claime>	 Ah right, the revert didn't finish that's why we're broken
[13:13:05] <_joe_>	 and apologies again for the breakage, I should've tested a deployment
[13:13:06] <claime>	 ok
[13:13:16] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[13:13:32] <_joe_>	 awight: you can re-deploy whenever you want
[13:13:41] <awight>	 _joe_: thanks
[13:14:03] <awight>	 claime: This revert is related, but in a different component.  kartotherian should have recovered already.
[13:14:37] <logmsgbot>	 !log awight@deploy1002 Started scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]]
[13:14:41] <stashbot>	 T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113
[13:14:52] <wikibugs>	 (03CR) 10Ayounsi: "LGTM! To be deployed after sending communication to sre-at-large@ (or public ops list) as it can impact people/apps flows (even though it " [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[13:15:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[13:16:12] <logmsgbot>	 !log awight@deploy1002 awight: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:17:24] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:44] <claime>	 [2023-01-30T12:47:32.575Z] ERROR: kartotherian/580 on maps1005: Unable to create source "osm"Source "osm-pbf" is disabled, possibly due to loading errors (err.levelPath=error)
[13:17:46] <claime>	     Err: Source "osm-pbf" is disabled, possibly due to loading errors
[13:17:50] <claime>	 On the maps servers
[13:18:04] <awight>	 claime: Thanks, we think we found the cause
[13:19:08] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:27] <wikibugs>	 (03PS2) 10JMeybohm: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto)
[13:20:51] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad
[13:21:14] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad (duration: 00m 22s)
[13:21:39] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad
[13:21:51] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 00m 11s)
[13:22:58] <nemo-yiannis>	 claime: This ^ deployment should fix the error you mention
[13:22:59] <wikibugs>	 (03Restored) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[13:23:11] <logmsgbot>	 !log awight@deploy1002 Finished scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] (duration: 08m 34s)
[13:23:16] <stashbot>	 T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113
[13:23:27] <wikibugs>	 (03PS3) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:23:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] "Nice, thanks! ❤️" [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto)
[13:24:05] <claime>	 nemo-yiannis: Does it need a service restart to take effect?
[13:24:16] <nemo-yiannis>	 I think scap did it already
[13:24:30] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:40] <claime>	 Not for maps1005 at least    Loaded: loaded (/lib/systemd/system/kartotherian.service; enabled; vendor preset: enabled)
[13:24:42] <claime>	    Active: active (running) since Mon 2023-01-30 12:46:06 UTC; 37min ago
[13:24:47] <wikibugs>	 (03Merged) 10jenkins-bot: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto)
[13:25:05] <nemo-yiannis>	 true i just checked the actual config
[13:25:07] <nemo-yiannis>	 let me think
[13:25:07] <claime>	 (that's kartotherian.service)
[13:27:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43483 and previous config saved to /var/cache/conftool/dbconfig/20230130-132701-ladsgroup.json
[13:27:06] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:28:00] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad
[13:28:02] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:13] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 01m 13s)
[13:29:20] <godog>	 !log bounce logstash on logstash1025 -- GC unhappy causing kafka lag
[13:29:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:35] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:29:48] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad
[13:30:25] <icinga-wm>	 RECOVERY - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[13:30:25] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:30:25] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:30:25] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:30:26] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:30:27] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:30:45] <awight>	 claime: Would you say this qualifies for an incident report?  I'm happy to write one if so.
[13:30:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/884876 (owner: 10Muehlenhoff)
[13:31:11] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad (duration: 01m 23s)
[13:31:21] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:27] <icinga-wm>	 RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian
[13:31:27] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:31:27] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:31:27] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:31:28] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:31:29] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian
[13:31:32] <claime>	 What was the actual production impact? (I am not very aware of what karthoterian does)
[13:31:58] <claime>	 kartotherian*
[13:32:41] <_joe_>	 uncached map tiles were unavailable to users
[13:33:13] <claime>	 So I'd say yes, it's an incident, especially since it lasted ~1h
[13:33:34] <awight>	 +1 yes I think a lot of people saw broken maps today
[13:33:55] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39302/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:34:19] <claime>	 https://grafana.wikimedia.org/goto/mqZt4GAVz?orgId=1 would agree
[13:34:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff)
[13:35:37] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39303/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:35:43] <claime>	 awight: 12:24:52 / 13:30 GMT for the incident window
[13:36:12] <awight>	 ty!
[13:36:26] * claime afk lunch
[13:36:33] <awight>	 I'll start it a bit earlier just because there was a smaller thing I broke with a side deployment :-/
[13:36:38] <claime>	 ack
[13:36:49] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39304/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:37:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:38:35] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39305/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:40:05] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:30] <wikibugs>	 (03PS4) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:41:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[13:42:15] <wikibugs>	 10SRE, 10serviceops, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE)
[13:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:43:08] <wikibugs>	 (03PS1) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909)
[13:43:33] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) Hm, I notice there’s no corresponding `nodejs14-devel` image in the [Docker registry](https://docker-registry.wikimedia.org/), only `nodejs14-slim` (and same for `node...
[13:43:37] <wikibugs>	 (03PS5) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:43:53] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:44:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:44:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43484 and previous config saved to /var/cache/conftool/dbconfig/20230130-134406-ladsgroup.json
[13:44:10] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:47:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:47:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:48:05] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:37] <wikibugs>	 (03PS6) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:48:39] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, left one little comment in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:50:06] <wikibugs>	 (03PS7) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:50:43] <wikibugs>	 (03PS4) 10EoghanGaffney: Send vrts httpd logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759)
[13:51:42] <wikibugs>	 (03PS8) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[13:52:09] <wikibugs>	 (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:52:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39309/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[13:53:29] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Send vrts httpd logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:55:51] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:56:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:56:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43485 and previous config saved to /var/cache/conftool/dbconfig/20230130-135632-ladsgroup.json
[13:56:36] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:56:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P43486 and previous config saved to /var/cache/conftool/dbconfig/20230130-135659-ladsgroup.json
[13:57:40] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143)
[13:58:07] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1400).
[14:00:05] <jouncebot>	 sbailey and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <Lucas_WMDE>	 o/
[14:00:33] <MatmaRex>	 hi
[14:00:33] <sbailey>	 I am here :-)
[14:00:34] <Lucas_WMDE>	 are we okay to deploy? I saw some alerts earlier
[14:01:28] <Lucas_WMDE>	 ok, looks like the karthoterian stuff is fine again
[14:01:54] <Lucas_WMDE>	 I’ll assume it’s okay to deploy unless someone tells me otherwise :)
[14:02:32] <Lucas_WMDE>	 let’s start with the reverts, those will take a while in CI
[14:02:44] <MatmaRex>	 thanks
[14:02:59] <Lucas_WMDE>	 hm, they’re not merged on master yet
[14:03:08] <Lucas_WMDE>	 but most of the jobs in zuul are done and green
[14:03:16] <Lucas_WMDE>	 so let’s +2 them
[14:03:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:03:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:04:17] <awight>	 Lucas_WMDE: +1 kartotherian should be stable sgain
[14:04:25] <Lucas_WMDE>	 ok thanks
[14:04:35] <sbailey>	 Ah looks like my patch 884090 (a config patch is missing a default case). Sseeing if I can fix that now.
[14:05:33] <Lucas_WMDE>	 sbailey: the variables also have an extra indentation level compared to their surroundings
[14:06:11] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:22] <sbailey>	 Arrg, one thing to note also is the extension does provide a default of false, so is that also required here?
[14:06:25] <MatmaRex>	 Lucas_WMDE: yeah sorry about that. i wasn't planning on doing this when i woke up today :)
[14:06:35] <Lucas_WMDE>	 ^^
[14:06:51] <Lucas_WMDE>	 sbailey: probably better to be explicit and specify the default, I think
[14:07:04] <Lucas_WMDE>	 I assume this is a temporary setting that will be removed at some point anyway
[14:07:06] <MatmaRex>	 they are just reverts though, so they should be safe (and it works locally)
[14:07:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43487 and previous config saved to /var/cache/conftool/dbconfig/20230130-140710-ladsgroup.json
[14:07:16] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[14:08:44] <sbailey>	 Should the default be false and the group0 true? 
[14:09:14] <Lucas_WMDE>	 yeah, I think so
[14:09:30] <wikibugs>	 (03PS1) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909)
[14:09:48] <wikibugs>	 (03PS1) 10Elukey: ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528)
[14:11:29] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:52] <wikibugs>	 (03PS3) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[14:12:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P43488 and previous config saved to /var/cache/conftool/dbconfig/20230130-141203-ladsgroup.json
[14:12:32] <wikibugs>	 (03CR) 10Sbailey: "Set default value and fixed indentation, stupid IDE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[14:13:13] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:30] <sbailey>	 I think I got 884090 fixed up
[14:13:56] <Lucas_WMDE>	 the indentation is still off, sorry
[14:14:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:14:05] <Lucas_WMDE>	 but the default looks good to me
[14:14:47] <wikibugs>	 (03CR) 10Jbond: C:varnish: Rate limit hotlinking dry-run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond)
[14:14:49] <wikibugs>	 (03PS4) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[14:15:06] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey)
[14:15:13] <Lucas_WMDE>	 the last line of each block (“],”) shouldn’t be indented either
[14:15:14] <sbailey>	 ok, now the indentation is fixed
[14:15:49] <wikibugs>	 (03PS5) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[14:15:54] <sbailey>	 Whack a mole with the IDE
[14:16:31] <sbailey>	 It is 6am my time so a bit fuzzy
[14:16:35] <Lucas_WMDE>	 ok, now it looks good to me
[14:16:51] <Lucas_WMDE>	 but MatmaRex’ backports are almost done in CI so let’s just do those first
[14:17:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:17:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:17:13] <sbailey>	 sounds good
[14:17:36] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[14:17:40] <stashbot>	 T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022
[14:17:59] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[14:18:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2140 with weight 0 T328022', diff saved to https://phabricator.wikimedia.org/P43489 and previous config saved to /var/cache/conftool/dbconfig/20230130-141822-root.json
[14:18:29] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[14:18:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:19:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:19:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022
[14:19:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/883519 (https://phabricator.wikimedia.org/T328022) (owner: 10Gerrit maintenance bot)
[14:19:41] <wikibugs>	 (03PS5) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569)
[14:19:44] <wikibugs>	 (03PS6) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[14:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński)
[14:21:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]]
[14:21:23] <stashbot>	 T328143: Machine Translation is broken when content has a link - https://phabricator.wikimedia.org/T328143
[14:22:03] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43490 and previous config saved to /var/cache/conftool/dbconfig/20230130-142216-ladsgroup.json
[14:22:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:23:12] <Lucas_WMDE>	 MatmaRex: can you test the reverts?
[14:24:18] <MatmaRex>	 yeah
[14:24:22] <MatmaRex>	 looking
[14:24:26] <Lucas_WMDE>	 ok
[14:25:00] <Lucas_WMDE>	 for a second I thought we might do a “can you?” “yes.” “will you?” “yes.” “…” routine :P
[14:25:23] <wikibugs>	 (03PS9) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[14:25:25] <MatmaRex>	 bah i can't. this is silly
[14:25:28] <MatmaRex>	 Access to XMLHttpRequest at 'https://cxserver.wikimedia.org/v2/page/fr/pl/Coquille_Saint-Jacques' from origin 'https://pl.wikipedia.org' has been blocked by CORS policy: Request header field x-wikimedia-debug is not allowed by Access-Control-Allow-Headers in preflight response.
[14:25:38] <Lucas_WMDE>	 blerghl
[14:25:44] <Lucas_WMDE>	 that’s annoying
[14:25:46] <MatmaRex>	 very annoying
[14:25:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[14:25:53] <Lucas_WMDE>	 there’s probably a phab task for it
[14:26:03] <MatmaRex>	 i wonder if there's some easy way to hack around that
[14:26:16] <taavi>	 just disable CORS, what could go wrong?
[14:26:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey)
[14:26:49] <Lucas_WMDE>	 there it is https://phabricator.wikimedia.org/T252826
[14:26:56] <MatmaRex>	 well if you disable it, then the thing won't work at all, it needs CORS to work
[14:26:59] <Lucas_WMDE>	 we can probably just sync this? it’s a revert, should be relatively safe…
[14:27:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P43491 and previous config saved to /var/cache/conftool/dbconfig/20230130-142708-ladsgroup.json
[14:27:14] <MatmaRex>	 i think it's safe, santhosh said it worked locally for him
[14:27:23] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:28] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] wikireplicas: drop views for pagetriage_log [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah)
[14:27:30] <Lucas_WMDE>	 oh nevermind, the task I linked is for rest / query service / whatever
[14:27:33] <Lucas_WMDE>	 but similar at least
[14:27:38] <Lucas_WMDE>	 ok, syncing
[14:27:39] <MatmaRex>	 VE itself works fine on mwdebug
[14:27:41] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:27:43] <MatmaRex>	 i only can't test CX
[14:27:58] <wikibugs>	 (03PS1) 10Btullis: Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926)
[14:28:31] <MatmaRex>	 i'll file a bug for this aterwards
[14:28:45] <Lucas_WMDE>	 thanks
[14:29:44] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:29:48] <wikibugs>	 (03PS2) 10Btullis: Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926)
[14:30:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[14:30:26] <wikibugs>	 (03PS2) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875909
[14:30:32] <wikibugs>	 (03PS1) 10JMeybohm: Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/884900 (https://phabricator.wikimedia.org/T327664)
[14:32:01] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875909 (owner: 10Matthias Mullie)
[14:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:33:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]] (duration: 12m 07s)
[14:33:31] <stashbot>	 T328143: Machine Translation is broken when content has a link - https://phabricator.wikimedia.org/T328143
[14:34:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[14:34:33] <sbailey>	 :-)
[14:34:48] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[14:35:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]]
[14:35:05] <MatmaRex>	 (i filed https://phabricator.wikimedia.org/T328310)
[14:35:07] <stashbot>	 T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612
[14:36:07] <Lucas_WMDE>	 thanks MatmaRex 
[14:36:09] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:09] <MatmaRex>	 Lucas_WMDE: my reverts are live, right? thanks
[14:36:17] <Lucas_WMDE>	 they should be, yeha
[14:36:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q3): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri)
[14:36:18] <Lucas_WMDE>	 *yeah
[14:36:30] <MatmaRex>	 yeah. things are working as expected now
[14:36:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and sbailey: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:36:53] <wikibugs>	 (03PS1) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664)
[14:37:03] <Lucas_WMDE>	 sbailey: can you test the change on mwdebug?
[14:37:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[14:37:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43492 and previous config saved to /var/cache/conftool/dbconfig/20230130-143723-ladsgroup.json
[14:38:01] <_joe_>	 jouncebot: now and next5
[14:38:01] <jouncebot>	 For the next 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1400)
[14:38:04] <wikibugs>	 (03PS2) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664)
[14:38:12] <sbailey>	 This is run from a job queue, so no, but I can look at the database and see if the colummns are being populated
[14:38:27] <_joe_>	 Lucas_WMDE: can you ping me when you're done, if you're doing the deployments?
[14:38:27] <Lucas_WMDE>	 ok, but probably only after it’s synced everywhere then
[14:38:31] <Lucas_WMDE>	 _joe_: sure
[14:38:33] <sbailey>	 yes
[14:38:37] <_joe_>	 thanks <3
[14:38:38] <Lucas_WMDE>	 ok
[14:38:41] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[14:38:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:38:48] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[14:38:53] <Lucas_WMDE>	 I’ll just quickly check that nothing is broken
[14:38:55] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[14:38:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: split wdqs SLIs in a new group [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306)
[14:39:30] <Lucas_WMDE>	 hm, https://test.wikidata.org/wiki/Special:LintErrors?namespace=8&titlesearch=&exactmatch=1 gives me “namespace and/or pagename not found or malformed” o_O
[14:39:37] <Lucas_WMDE>	 but it’s the same with or without x-wikimedia-debug
[14:39:56] <Lucas_WMDE>	 ok, https://test.wikidata.org/wiki/Special:LintErrors?namespace=0&titlesearch=A&exactmatch= works
[14:40:15] <Lucas_WMDE>	 let’s sync then
[14:41:14] <sbailey>	 yes, last time there were two straggling db's that missed the columns add, that was resolved.
[14:41:14] <sbailey>	 Maybe there are more stragglers, thought Amir did a report that verified all were updated
[14:41:19] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "thanks, sgtm for near term fix" [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306) (owner: 10Filippo Giunchedi)
[14:41:27] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P43493 and previous config saved to /var/cache/conftool/dbconfig/20230130-144213-ladsgroup.json
[14:43:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3051.esams.wmnet with OS bullseye
[14:43:18] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3051.esams.wmnet with OS bullseye
[14:43:22] <sbailey>	 If a db missed the addition of linter_namespace and linter_tag and linter_template, the code will error :-(
[14:43:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39319/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[14:44:07] * claime back
[14:46:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]] (duration: 11m 11s)
[14:46:19] <stashbot>	 T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612
[14:46:44] <Lucas_WMDE>	 _joe_: I’m done, assuming there are no errors from the last deployment
[14:46:56] * Lucas_WMDE sees /tmp/joetest in logwatch ^^
[14:47:07] <_joe_>	 Lucas_WMDE: erheh ahem
[14:47:08] <_joe_>	 cough
[14:47:23] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: Add abhas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/883933 (https://phabricator.wikimedia.org/T328015) (owner: 10Clément Goubert)
[14:47:40] <moritzm>	 !log updating puppetdb 7 hosts to 7.12.1 T321783
[14:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:44] <stashbot>	 T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783
[14:50:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: split wdqs SLIs in a new group [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306) (owner: 10Filippo Giunchedi)
[14:51:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "This patch was reverted because scap got the following error:" [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle)
[14:52:03] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:26] <sbailey>	 testing linter errors on testwiki
[14:52:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43494 and previous config saved to /var/cache/conftool/dbconfig/20230130-145229-ladsgroup.json
[14:52:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:52:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:52:34] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[14:54:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2140 to s4 primary T328022', diff saved to https://phabricator.wikimedia.org/P43495 and previous config saved to /var/cache/conftool/dbconfig/20230130-145421-root.json
[14:54:25] <stashbot>	 T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022
[14:55:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 T328022', diff saved to https://phabricator.wikimedia.org/P43496 and previous config saved to /var/cache/conftool/dbconfig/20230130-145508-root.json
[14:56:54] <sbailey>	 It is working in testwiki, new errors are being recorded :-)
[14:57:23] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) p:05Triage→03Low
[14:57:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10herron) 05In progress→03Resolved Hi @Abhas, the requested access has been provisioned and will fully propagate across the fleet within 30 minutes.  A...
[14:58:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43497 and previous config saved to /var/cache/conftool/dbconfig/20230130-145759-root.json
[14:58:22] <wikibugs>	 (03PS1) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313)
[14:58:36] <Lucas_WMDE>	 sbailey: yay \o/
[14:59:45] <wikibugs>	 (03PS1) 10EoghanGaffney: Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759)
[14:59:58] <sbailey>	 :-), looking at Quarry now for group0
[15:00:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10herron) p:05Triage→03Medium
[15:00:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:01:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:01:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:01:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43498 and previous config saved to /var/cache/conftool/dbconfig/20230130-150132-ladsgroup.json
[15:01:37] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[15:01:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[15:01:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:03:49] <wikibugs>	 (03PS2) 10EoghanGaffney: Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759)
[15:04:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) Thanks @jbond for the patch and help! I can confirm that:  ` sudo cookbook -vvvv  -c /hom...
[15:04:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3051.esams.wmnet with reason: host reimage
[15:07:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3051.esams.wmnet with reason: host reimage
[15:08:05] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43499 and previous config saved to /var/cache/conftool/dbconfig/20230130-151228-ladsgroup.json
[15:12:32] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[15:13:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43500 and previous config saved to /var/cache/conftool/dbconfig/20230130-151304-root.json
[15:13:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, but a second review from observability would be great :)" [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[15:13:59] <marostegui>	 !log Retrospective: Starting s4 codfw failover from db2110 to db2140 - T328022
[15:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:03] <stashbot>	 T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022
[15:16:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[15:16:43] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:11] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:39] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[15:22:41] <wikibugs>	 (03PS3) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[15:22:43] <wikibugs>	 (03PS3) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[15:23:19] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:22] <wikibugs>	 (03CR) 10Jelto: "one question in line" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:26:32] <wikibugs>	 (03PS3) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664)
[15:26:34] <wikibugs>	 (03PS1) 10JMeybohm: Drop profile::ci::kubernetes_config [puppet] - 10https://gerrit.wikimedia.org/r/884915
[15:26:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:27:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43501 and previous config saved to /var/cache/conftool/dbconfig/20230130-152734-ladsgroup.json
[15:28:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43502 and previous config saved to /var/cache/conftool/dbconfig/20230130-152809-root.json
[15:29:47] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2029.codfw.wmnet with OS bullseye
[15:29:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2029.codfw.wmnet with OS bullseye
[15:29:55] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:57] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:20] <wikibugs>	 (03PS4) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664)
[15:31:11] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:45] <wikibugs>	 (03CR) 10JMeybohm: Switch the active staging cluster to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:31:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3051.esams.wmnet with OS bullseye
[15:31:58] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3051.esams.wmnet with OS bullseye completed: - cp3051 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[15:32:52] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:33:39] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39320/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:34:53] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10colewhite)
[15:35:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[15:36:19] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:03] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:31] <wikibugs>	 (03PS4) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[15:41:35] <wikibugs>	 (03CR) 10Jelto: "I found one more kubestagemaster.svc.eqiad.wmnet in releases configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:42:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43503 and previous config saved to /var/cache/conftool/dbconfig/20230130-154241-ladsgroup.json
[15:43:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43504 and previous config saved to /var/cache/conftool/dbconfig/20230130-154314-root.json
[15:43:17] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:13] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Ya weird that it doesn't work!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[15:46:12] <wikibugs>	 (03PS5) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664)
[15:46:31] <wikibugs>	 (03CR) 10JMeybohm: Switch the active staging cluster to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:47:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39321/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm)
[15:48:40] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage
[15:50:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:51:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage
[15:52:07] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:52] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280)
[15:54:06] <wikibugs>	 (03PS5) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[15:54:08] <wikibugs>	 (03PS4) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[15:54:10] <wikibugs>	 (03PS1) 10Jbond: redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921
[15:54:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS bullseye
[15:54:36] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye
[15:55:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[15:55:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:57:25] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "+1 but one Q/naming nit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[15:57:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921 (owner: 10Jbond)
[15:57:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "No worries if not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[15:57:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43505 and previous config saved to /var/cache/conftool/dbconfig/20230130-155747-ladsgroup.json
[15:57:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:57:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:57:52] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[15:57:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:57:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:57:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[15:58:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43506 and previous config saved to /var/cache/conftool/dbconfig/20230130-155802-ladsgroup.json
[15:58:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43507 and previous config saved to /var/cache/conftool/dbconfig/20230130-155819-root.json
[15:59:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155)
[15:59:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet
[15:59:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet
[16:01:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[16:03:27] <moritzm>	 !log upgrading idp-test to latest Java security update
[16:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:47] <wikibugs>	 (03PS2) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155)
[16:03:54] <sukhe>	 !log racreset cp3050.esams.wmnet: firmware cookbook iDRAC upgrade test
[16:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:01] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10Patch-For-Review: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10isarantopoulos) In the patch above I convert the dictionary passed in `form_body` field to json if there is the header `Content-Type...
[16:05:00] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5026.eqsin.wmnet with OS bullseye
[16:05:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye executed with errors: - cp5026 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[16:05:33] <wikibugs>	 (03CR) 10Elukey: wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[16:05:42] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-01-30 12:16:40 (1170 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:05:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[16:06:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS bullseye
[16:06:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye
[16:08:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43508 and previous config saved to /var/cache/conftool/dbconfig/20230130-160829-ladsgroup.json
[16:08:35] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:10:00] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3050.esams.wmnet,service=cdn
[16:10:00] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3050.esams.wmnet,service=ats-be
[16:10:19] <wikibugs>	 (03CR) 10Elukey: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[16:10:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet
[16:10:47] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet
[16:11:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet
[16:13:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43509 and previous config saved to /var/cache/conftool/dbconfig/20230130-161324-root.json
[16:15:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:16:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:16:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2029.codfw.wmnet with OS bullseye
[16:16:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2029.codfw.wmnet with OS bullseye completed: - cp2029 (**WARN**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[16:17:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet
[16:17:34] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280)
[16:18:13] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[16:19:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[16:21:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3050.esams.wmnet with OS bullseye
[16:21:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3050.esams.wmnet with OS bullseye
[16:22:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=cdn
[16:22:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be
[16:22:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=cdn
[16:22:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=ats-be
[16:23:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43510 and previous config saved to /var/cache/conftool/dbconfig/20230130-162336-ladsgroup.json
[16:24:44] <wikibugs>	 (03PS2) 10Elukey: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768)
[16:24:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1084.eqiad.wmnet
[16:25:21] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye
[16:25:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye
[16:25:30] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[16:25:35] <wikibugs>	 (03CR) 10Elukey: wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[16:25:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:25:50] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[16:26:52] <wikibugs>	 (03PS1) 10Btullis: Revert "Increase the presto cluster size to 15 hosts again" [puppet] - 10https://gerrit.wikimedia.org/r/884928
[16:27:30] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[16:29:32] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Updated Java security policy in OpenJDK 11.0.18 - https://phabricator.wikimedia.org/T328331 (10MoritzMuehlenhoff)
[16:30:04] <jouncebot>	 jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1630).
[16:30:14] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1084.eqiad.wmnet
[16:30:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Increase the presto cluster size to 15 hosts again" [puppet] - 10https://gerrit.wikimedia.org/r/884928 (owner: 10Btullis)
[16:30:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[16:30:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Either Brian or I will build these and deploy soon." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[16:31:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MPhamWMF)
[16:31:21] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MPhamWMF)
[16:35:22] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4043.ulsfo.wmnet with OS bullseye
[16:35:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye executed with errors: - cp4043 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[16:35:41] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye
[16:35:48] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye
[16:37:04] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43511 and previous config saved to /var/cache/conftool/dbconfig/20230130-163842-ladsgroup.json
[16:38:59] <wikibugs>	 (03CR) 10Elukey: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[16:39:37] <wikibugs>	 (03PS6) 10Jbond: redfish: Move delli specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[16:39:39] <wikibugs>	 (03PS5) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[16:39:53] <wikibugs>	 (03Abandoned) 10Jbond: redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921 (owner: 10Jbond)
[16:40:42] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:41:06] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10bd808) #Toolhub does not have a working Kubernetes deployment outside of eqiad ({T288685}). Who should I work with to try and preve...
[16:41:36] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[16:42:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[16:43:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Move delli specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond)
[16:44:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3050.esams.wmnet with reason: host reimage
[16:44:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[16:44:48] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[16:46:03] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[16:46:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney)
[16:46:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney)
[16:48:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3050.esams.wmnet with reason: host reimage
[16:48:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski)
[16:50:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[16:51:20] <wikibugs>	 (03PS7) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[16:52:32] <wikibugs>	 (03CR) 10Andrew Bogott: "I will fix the linter issue but here's pcc results:" [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[16:53:36] <wikibugs>	 (03PS3) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155)
[16:53:42] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[16:53:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43512 and previous config saved to /var/cache/conftool/dbconfig/20230130-165348-ladsgroup.json
[16:53:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[16:53:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[16:53:53] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:53:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[16:53:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43513 and previous config saved to /var/cache/conftool/dbconfig/20230130-165359-ladsgroup.json
[16:54:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Rabbitmq: use OpenStack bpo packages for rabbit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[16:54:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond)
[16:54:51] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[16:56:09] <wikibugs>	 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10MPhamWMF)
[16:56:20] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10LSobanski)
[16:56:32] <wikibugs>	 (03PS8) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[16:56:36] <wikibugs>	 (03PS6) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[16:56:37] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage
[16:56:40] <wikibugs>	 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10Gehel)
[16:59:23] <wikibugs>	 (03PS7) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[16:59:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[16:59:40] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage
[16:59:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond)
[17:02:50] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Updated Java security policy in OpenJDK 11.0.18 - https://phabricator.wikimedia.org/T328331 (10MoritzMuehlenhoff)
[17:03:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[17:04:09] <wikibugs>	 (03PS8) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[17:04:25] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup. Downtimed https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43514 and previous config saved to /var/cache/conftool/dbconfig/20230130-170437-ladsgroup.json
[17:04:43] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[17:06:56] <wikibugs>	 (03PS4) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155)
[17:07:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[17:08:18] <wikibugs>	 (03CR) 10Bking: [C: 03+1] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[17:09:38] <wikibugs>	 (03PS2) 10Bking: flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[17:10:04] <wikibugs>	 (03CR) 10Bking: [V: 03+2] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[17:10:07] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[17:10:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:10:54] <wikibugs>	 (03CR) 10Ebernhardson: Create scap deployment source for search airflow v2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[17:11:51] <wikibugs>	 (03PS5) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155)
[17:11:53] <wikibugs>	 (03CR) 10Jbond: redfish: Move dell specific functionality to dell class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond)
[17:12:13] <wikibugs>	 (03PS1) 10Hokwelum: The rsync module have been changed from download.kiwix.org to wmf.download.kiwix.org, See phab ticket for more information [puppet] - 10https://gerrit.wikimedia.org/r/884965
[17:12:25] <wikibugs>	 10SRE, 10Traffic-Icebox: varnish warnings: Invalid conf pair: lg_dirty_mult/lg_chunk - https://phabricator.wikimedia.org/T253379 (10BCornwall) 05Open→03Resolved a:03BCornwall This has already been removed on 2022-11-11 via:  Commit: 9943816a2ee487128f77c18cd2b104ebe1c0cd50 Change-Id: Ib55afb0acc28eab197c...
[17:12:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] The rsync module have been changed from download.kiwix.org to wmf.download.kiwix.org, See phab ticket for more information [puppet] - 10https://gerrit.wikimedia.org/r/884965 (owner: 10Hokwelum)
[17:12:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3050.esams.wmnet with OS bullseye
[17:12:47] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3050.esams.wmnet with OS bullseye completed: - cp3050 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[17:14:54] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/884922/39324/" [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[17:15:01] <wikibugs>	 (03PS2) 10Hokwelum: Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965
[17:15:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (owner: 10Hokwelum)
[17:15:26] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10BCornwall) Given the intention of moving away from LVS, is this still a feature we want implemented? i.e. is it worth pursuing this when LVS may be replaced in a few years?
[17:19:08] <wikibugs>	 (03PS3) 10Hokwelum: Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223)
[17:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43515 and previous config saved to /var/cache/conftool/dbconfig/20230130-171944-ladsgroup.json
[17:20:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:21:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS bullseye
[17:21:56] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye completed: - cp5026 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[17:22:04] <inflatador>	 !log bking@build2001 rebuilding docker images for 884351
[17:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:33] <wikibugs>	 (03PS9) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749
[17:22:35] <wikibugs>	 (03PS9) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757
[17:24:02] <inflatador>	 !log bking@build2001 rebuilding docker images for 884351 complete
[17:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:49] <wikibugs>	 (03PS1) 10Ottomata: [WIP] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305)
[17:26:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:26:51] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:27:04] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS bullseye
[17:27:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye completed: - cp4043 (**WARN**)   - Removed from Puppet and PuppetDB if present   -...
[17:28:29] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:31:05] <wikibugs>	 (03CR) 10Ottomata: Configure search platform airflow 2 instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[17:31:09] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet
[17:31:44] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[17:31:49] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS bullseye
[17:32:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye
[17:33:01] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:12] <wikibugs>	 (03PS1) 10Legoktm: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942)
[17:33:45] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2330 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:34:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=cdn
[17:34:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=ats-be
[17:34:42] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:34:44] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43516 and previous config saved to /var/cache/conftool/dbconfig/20230130-173450-ladsgroup.json
[17:35:13] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[17:35:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:36:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:36:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=cdn
[17:36:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-be
[17:40:47] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2332 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:41:32] <wikibugs>	 (03PS1) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978
[17:43:13] <wikibugs>	 (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond)
[17:43:25] <wikibugs>	 (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond)
[17:43:26] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4051.ulsfo.wmnet with OS bullseye
[17:43:31] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye executed with errors: - cp4051 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[17:43:53] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS bullseye
[17:43:58] <wikibugs>	 (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond)
[17:44:00] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye
[17:45:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[17:45:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:46:58] <wikibugs>	 (03PS1) 10Jdlrobson: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423)
[17:49:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3052.esams.wmnet
[17:49:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43517 and previous config saved to /var/cache/conftool/dbconfig/20230130-174957-ladsgroup.json
[17:50:02] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[17:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:51:11] <icinga-wm>	 PROBLEM - IPMI Sensor Status on maps2009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:52:14] <sukhe>	 ^ seems to be codfw rack B6
[17:52:18] <wikibugs>	 (03PS3) 10Urbanecm: [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427
[17:52:24] <sukhe>	 maps2009, mw2330, etc.
[17:52:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm)
[17:52:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm)
[17:53:13] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm)
[17:53:30] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884427|[Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled]]
[17:53:38] <RhinosF1>	 sukhe: probably worth mentioning in -dcops so everything else doesn’t drown out
[17:53:55] <sukhe>	 RhinosF1: yeah, going to file a task, sometimes there are recoveries so was waiting a bit
[17:54:49] <RhinosF1>	 Cool :)
[17:56:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223) (owner: 10Hokwelum)
[17:57:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10ssingh)
[17:57:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223) (owner: 10Hokwelum)
[17:57:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10ssingh) p:05Triage→03Medium
[17:57:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[17:57:52] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10akosiaris) It is intentional indeed. `-devel` because obsolete. More information in T306996#7912881 and overall that task.
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1800)
[18:00:05] <jouncebot>	 ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1800)
[18:01:29] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884427|[Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled]] (duration: 07m 59s)
[18:04:19] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage
[18:05:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:06:16] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall)
[18:06:54] <wikibugs>	 (03PS1) 10Bking: flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576)
[18:07:26] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage
[18:07:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2326 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:08:51] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3052.esams.wmnet
[18:10:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye
[18:10:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye
[18:10:47] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2332 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:13:08] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) Postponing the PDU maintenance for  2023-02-02 for possible bad weather in Dallas tomorrow.
[18:13:27] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-02-02 - https://phabricator.wikimedia.org/T327404 (10Papaul)
[18:19:03] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3052.esams.wmnet with OS bullseye
[18:19:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:19:09] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[18:19:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye
[18:19:35] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye
[18:20:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:21:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on maps2009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:22:59] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Configuration bits for the release Jenkins should be moved up to profile::releases::mediawiki  . And later on the CI Jenkins will have its" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[18:23:58] <wikibugs>	 (03CR) 10Hashar: "From the parent change it should be done using hiera configuration by setting the `jenkins::use_scap3_deployment` flag in the `hiera/hosts" [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[18:26:01] <icinga-wm>	 PROBLEM - IPMI Sensor Status on kubernetes2009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:27:21] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2334 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:29:13] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[18:29:15] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[18:31:08] <icinga-wm>	 RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2023-01-30 17:17:18 (836 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[18:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:33:48] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2330 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:34:20] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4051.ulsfo.wmnet with OS bullseye
[18:34:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye completed: - cp4051 (**WARN**)   - Removed from Puppet and PuppetDB if present   -...
[18:37:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:37:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:37:52] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3052.esams.wmnet with OS bullseye
[18:38:00] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[18:38:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye
[18:38:13] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye
[18:41:31] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Update campaign configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza)
[18:43:28] <wikibugs>	 (03PS3) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970)
[18:43:30] <wikibugs>	 (03CR) 10Ebernhardson: Configure search platform airflow 2 instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[18:44:31] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Update campaign configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza)
[18:44:33] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39326/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[18:45:24] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3052.esams.wmnet with OS bullseye
[18:45:31] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[18:45:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:46:04] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:46:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:46:39] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp3052.esams.wmnet']
[18:50:46] <wikibugs>	 (03CR) 10RLazarus: [C: 04-1] "Thanks for the patch! We ought to support this in httpbb -- the only reason it's not there already is that we haven't needed it yet." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[18:52:33] <wikibugs>	 (03PS1) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989
[18:53:07] <wikibugs>	 (03PS10) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[18:55:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond)
[18:56:40] <icinga-wm>	 RECOVERY - IPMI Sensor Status on kubernetes2009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:56:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[18:58:02] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2334 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:00:26] <wikibugs>	 (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 (owner: 10PipelineBot)
[19:00:33] <wikibugs>	 (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260 (owner: 10PipelineBot)
[19:00:52] <wikibugs>	 (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/826957 (owner: 10PipelineBot)
[19:01:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye
[19:01:03] <wikibugs>	 (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot)
[19:01:06] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye
[19:05:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:10:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:15:37] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4051.ulsfo.wmnet
[19:15:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:16:36] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye
[19:16:43] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:16:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye
[19:18:10] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10ayounsi) It's fine to close this task as long as BFD and graceful-shutdown are on the roadmap for the new L4LB. Not directly related to LVS but the task description on {T328338} explains...
[19:19:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "per https://debmonitor.wikimedia.org/packages/atftpd it's only installed on install* machines and per sudo cumin 'C:role::installserver' '" [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[19:19:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3052.esams.wmnet with reason: host reimage
[19:25:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3052.esams.wmnet with reason: host reimage
[19:26:44] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4044.ulsfo.wmnet with OS bullseye
[19:26:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye executed with errors: - cp4044 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[19:26:59] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye
[19:27:05] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye
[19:31:42] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov)
[19:32:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:32] <wikibugs>	 (03PS1) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996
[19:33:27] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[19:34:22] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) I suppose we could also have aliases for the bastion hosts so instead of connecting to `bast3006` users can specify `bast-esams` (which would actually be a huge improvement) but geolocating w...
[19:34:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond)
[19:35:47] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[19:36:12] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[19:37:23] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall)
[19:43:11] <wikibugs>	 (03PS1) 10Jbond: reposync: switch from copy_tree to copytree [software/spicerack] - 10https://gerrit.wikimedia.org/r/884998
[19:43:55] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10PyBal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10ayounsi)
[19:44:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi)
[19:44:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2033.codfw.wmnet with OS bullseye
[19:44:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2033.codfw.wmnet with OS bullseye
[19:45:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:47:14] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[19:48:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3052.esams.wmnet with OS bullseye
[19:48:36] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye completed: - cp3052 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[19:50:18] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[19:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:51:34] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov)
[19:52:44] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov)
[19:53:03] <wikibugs>	 (03PS9) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[19:53:51] <wikibugs>	 (03CR) 10Samtar: "(reset my CR, T310974#8368960 is stalling afaict)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[19:56:02] <wikibugs>	 (03PS1) 10Zabe: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002
[19:57:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3052.esams.wmnet,service=cdn
[19:57:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3052.esams.wmnet,service=ats-be
[19:58:11] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[19:58:50] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10RhinosF1) >>! In T328361#8571672, @mpopov wrote: > I suppose we could also have aliases for the bastion hosts so instead of connecting to `bast3006` users can specify `bast-esams` (which would actual...
[20:00:29] <wikibugs>	 (03PS2) 10Zabe: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002
[20:00:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:00:49] <wikibugs>	 (03PS2) 10Ottomata: [WIP] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305)
[20:02:11] <wikibugs>	 (03PS3) 10Ottomata: Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305)
[20:02:13] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall)
[20:03:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage
[20:03:30] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) @Vgutierrez I've confirmed the remaining services use TLSv1.2+ except for ldap-codfw1dev and ldap-labtest. I'm having a little trouble accessing those servers - are they still...
[20:05:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:06:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[20:06:34] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage
[20:11:45] <wikibugs>	 (03PS3) 10Urbanecm: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe)
[20:11:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe)
[20:12:31] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS bullseye
[20:12:37] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye completed: - cp4044 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[20:12:45] <wikibugs>	 10SRE, 10Traffic-Icebox: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10BCornwall) 05Open→03Declined As there's already a link to the browser recommendation wikitech page, there's no need to duplicate efforts.
[20:12:49] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) 05Open→03In progress
[20:12:55] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) a:03BCornwall
[20:13:55] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe)
[20:14:38] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet
[20:14:48] <wikibugs>	 (03Merged) 10jenkins-bot: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe)
[20:15:38] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:15:40] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[20:15:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye
[20:16:01] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]]
[20:17:39] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:23:34] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]] (duration: 07m 32s)
[20:25:57] <wikibugs>	 (03PS1) 10Majavah: hieradata: drop ldap-labtest acme-chier cert [puppet] - 10https://gerrit.wikimedia.org/r/885026
[20:26:15] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2033.codfw.wmnet with OS bullseye
[20:26:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2033.codfw.wmnet with OS bullseye completed: - cp2033 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[20:26:42] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650)
[20:26:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Merging to test deployment, skipping the helmfile lint error.  Something must be wrong with a .Values.kafka_brokers fixture for this helmf" [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[20:27:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza)
[20:29:11] <wikibugs>	 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10taavi) >>! In T238518#8571817, @BCornwall wrote: > @Vgutierrez I've confirmed the remaining services use TLSv1.2+ except for ldap-codfw1dev and ldap-labtest. I'm having a little trouble...
[20:29:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[20:30:33] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[20:35:32] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye
[20:35:38] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye executed with errors: - cp4052 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[20:35:55] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[20:36:01] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye
[20:36:13] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) > You’d then get a scary warning about a key mismatch when the server was changed. >  > Surely, this host doesn’t exist anymore is a clearer error.  Oh you're right! That's a great point, tha...
[20:45:35] <wikibugs>	 (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305)
[20:46:53] <wikibugs>	 (03PS2) 10Ottomata: mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305)
[20:49:05] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[20:50:42] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:51:25] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:55:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:56:36] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[20:59:41] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2100).
[21:00:05] <jouncebot>	 tgr, musikanimal, legoktm, jdlrobson, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:12] <musikanimal>	 o/
[21:00:20] <urbanecm>	 o/
[21:00:23] <urbanecm>	 i can deploy today
[21:00:40] <wikibugs>	 (03PS3) 10Urbanecm: InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal)
[21:00:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal)
[21:00:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:01:02] <tgr_>	 o/
[21:01:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal)
[21:01:24] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal)
[21:01:34] <urbanecm>	 hi tgr_, CI seems to dislike the campaigns patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884153/). can you check please?
[21:01:38] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]]
[21:01:39] <legoktm>	 hi I'm here
[21:01:46] <stashbot>	 T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387
[21:01:56] <urbanecm>	 hi legoktm
[21:02:10] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=cdn
[21:02:10] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be
[21:02:17] <Jdlrobson>	 present
[21:02:21] <wikibugs>	 (03PS2) 10Urbanecm: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm)
[21:02:34] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[21:02:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm)
[21:02:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson)
[21:03:21] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and musikanimal: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:03:33] <urbanecm>	 musikanimal: pulled onto mwdebug1001, let me know how it works :)
[21:03:45] <musikanimal>	 will do! might take me a few mins, sorry I wasn't prepared
[21:03:55] <urbanecm>	 sure
[21:04:14] <wikibugs>	 (03PS4) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370)
[21:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm)
[21:04:38] <tgr_>	 urbanecm: oops, sorry. last minute changes. Should be fixed now.
[21:04:47] <urbanecm>	 np, it happens.
[21:08:35] <musikanimal>	 so this has to only do with data storage. That must persist across prod and the debug servers, right? We don't have a separate db for mwdebug* ?
[21:08:43] <urbanecm>	 indeed
[21:09:57] <musikanimal>	 okay. Well my issue is I can't find an example... I need an article that uses a "task force" in addition to a WikiProject. Might take me another 5-10 minutes... unfortunately Whatlinkshere isn't giving good results because the task force template is also used by normal WikiProjects
[21:10:19] <urbanecm>	 what's a task force? maybe i can help?
[21:10:30] <wikibugs>	 (03CR) 10Gergő Tisza: GrowthExperiments: Update campaign configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[21:10:44] <wikibugs>	 (03PS1) 10Ottomata: mw-page-content-change-enrichment - Disable kafka egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/885035 (https://phabricator.wikimedia.org/T325305)
[21:11:05] <musikanimal>	 a task force is a subset of a WikiProject. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Guide/Task_forces is the enwiki documentation, zhwiki does the same thing
[21:11:36] <musikanimal>	 I'm running some queries on prod to try to find an example. The WikiProject name would have a slash in it (as it is a "subproject" of the WikiProject, so to speak)
[21:11:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2019.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001
[21:12:16] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2333 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[21:12:23] <urbanecm>	 tz
[21:13:40] <urbanecm>	 musikanimal: does https://zh.wikipedia.org/wiki/WikiProject:%E7%94%B5%E5%AD%90%E6%B8%B8%E6%88%8F/%E5%8F%B2%E5%85%8B%E5%A8%81%E5%B0%94%E8%89%BE%E5%B0%BC%E5%85%8B%E6%96%AF work<
[21:14:00] <musikanimal>	 possibly
[21:14:15] <musikanimal>	 https://zh.wikipedia.org/wiki/Talk:%E5%90%89%E6%99%AE%E6%81%B0%E5%85%8B%E6%B8%85%E7%9C%9F%E5%AF%BA for sure uses a task force, but I'm not seeing the flag being set in the db after I do a null edit :(
[21:14:41] <urbanecm>	 might be because it uses a job?
[21:14:54] <musikanimal>	 it usually populates immediately if I do a null edit
[21:14:58] <urbanecm>	 ah
[21:14:58] <musikanimal>	 but I could be testing this wrong
[21:15:19] <urbanecm>	 since the site doesn't break, i can sync and let you and your team figure out what's happening later?
[21:15:21] <musikanimal>	 so I'm like 99% sure the patch is harmless. Page assessments aren't even being used right now by anything
[21:15:23] <musikanimal>	 yeah
[21:15:27] <urbanecm>	 okay, syncing
[21:15:29] <musikanimal>	 let's just move forward :)
[21:15:31] <musikanimal>	 thanks!
[21:16:18] <wikibugs>	 (03PS2) 10Urbanecm: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza)
[21:16:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza)
[21:17:11] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza)
[21:17:13] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrichment - Disable kafka egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/885035 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata)
[21:17:40] <urbanecm>	 tgr_: should we copy the messages from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/884152 on wiki? or is it not important enough for the initial rollout?
[21:17:51] <urbanecm>	 it's marked as soft depend-on, so that's why i'm asking
[21:17:51] <wikibugs>	 (03Merged) 10jenkins-bot: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson)
[21:18:23] <tgr_>	 Not important, the real rollout is when something starts to reference this in landing page URLs.
[21:18:40] <urbanecm>	 ah, makes sense. i'll go ahead then.
[21:20:40] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:21:29] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]] (duration: 19m 51s)
[21:21:34] <stashbot>	 T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387
[21:21:41] <urbanecm>	 musikanimal: all live now.
[21:21:48] <musikanimal>	 thank you!
[21:21:48] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]]
[21:21:50] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2019.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001
[21:21:56] <urbanecm>	 np
[21:21:57] <stashbot>	 T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423
[21:21:57] <stashbot>	 T325376: Welcome survey: communication and deployment to Vietnamese Wikipedia - https://phabricator.wikimedia.org/T325376
[21:21:58] <stashbot>	 T327942: Table of contents displays wrong on global user pages on Vector 2022 - https://phabricator.wikimedia.org/T327942
[21:23:27] <logmsgbot>	 !log urbanecm@deploy1002 tgr and urbanecm and jdlrobson and legoktm: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:23:44] <legoktm>	 testing
[21:23:48] <urbanecm>	 thanks
[21:23:53] <urbanecm>	 tgr_: Jdlrobson: please test too ^^
[21:24:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2020.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001
[21:24:10] <tgr_>	 works
[21:24:13] <urbanecm>	 ty
[21:24:23] <Jdlrobson>	 urbanecm: looking!
[21:24:24] <legoktm>	 urbanecm: lgtm! thanks
[21:24:28] <urbanecm>	 thanks!
[21:24:38] <legoktm>	 (verified with https://test.wikipedia.org/wiki/User:Legoktm?useskin=vector-2022)
[21:25:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2030.codfw.wmnet with OS bullseye
[21:25:16] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2030.codfw.wmnet with OS bullseye
[21:25:31] <Jdlrobson>	 urbanecm: LGTM
[21:25:37] <urbanecm>	 arlolra: hi, are you around for your MW core / https://gerrit.wikimedia.org/r/c/884138/ backport? looks like a no-op just adding some profiling, but still wouldn't like to deploy it alone :))
[21:25:43] <arlolra>	 yup
[21:25:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:25:50] <urbanecm>	 thanks Jdlrobson, deploying
[21:25:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (https://phabricator.wikimedia.org/T328201) (owner: 10Arlolra)
[21:26:22] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.24 ms
[21:26:38] <urbanecm>	 arlolra: will you want to test it at a debug server? or should i just sync?
[21:26:39] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye
[21:26:45] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye completed: - cp4052 (**WARN**)   - Removed from Puppet and PuppetDB if present   -...
[21:26:54] <arlolra>	 urbanecm: I can try a quick test
[21:27:03] <urbanecm>	 okay, i'll ping you when ready
[21:27:15] <wikibugs>	 10SRE-OnFire, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10andrea.denisse)
[21:27:42] <wikibugs>	 (03PS5) 10Urbanecm: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[21:27:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[21:29:01] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[21:30:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:31:41] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]] (duration: 09m 52s)
[21:31:48] <stashbot>	 T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423
[21:31:49] <stashbot>	 T325376: Welcome survey: communication and deployment to Vietnamese Wikipedia - https://phabricator.wikimedia.org/T325376
[21:31:49] <stashbot>	 T327942: Table of contents displays wrong on global user pages on Vector 2022 - https://phabricator.wikimedia.org/T327942
[21:31:51] <urbanecm>	 legoktm: tgr_: Jdlrobson: all live :)
[21:32:09] <legoktm>	 perfect :D
[21:32:25] <tgr_>	 thx!
[21:32:32] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[21:33:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Sounds good to me!" [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron)
[21:33:59] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet
[21:33:59] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]]
[21:34:05] <stashbot>	 T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370
[21:34:18] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2020.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001
[21:34:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[21:34:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[21:34:53] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:35:41] <logmsgbot>	 !log urbanecm@deploy1002 tgr and urbanecm: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:35:56] <urbanecm>	 tgr_: second patch's available for testing, can you check?
[21:36:41] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008
[21:36:57] <tgr_>	 urbanecm: it works
[21:37:04] <urbanecm>	 great, syncing
[21:40:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008 (owner: 10Urbanecm)
[21:41:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008 (owner: 10Urbanecm)
[21:41:40] <Jdlrobson>	 Ack! thanks urbanecm !
[21:41:44] <urbanecm>	 no problem
[21:42:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]] (duration: 08m 47s)
[21:42:51] <stashbot>	 T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370
[21:42:54] <urbanecm>	 tgr_: and live
[21:43:12] <tgr_>	 thanks!
[21:43:43] <wikibugs>	 (03Merged) 10jenkins-bot: Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (https://phabricator.wikimedia.org/T328201) (owner: 10Arlolra)
[21:43:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage
[21:44:12] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 72 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:44:20] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]]
[21:44:26] <stashbot>	 T328201: Investigate increase in slow parses - https://phabricator.wikimedia.org/T328201
[21:46:03] <logmsgbot>	 !log urbanecm@deploy1002 arlolra and urbanecm: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:46:13] <urbanecm>	 arlolra: your patch's at mwdebug1001, as promised :)
[21:46:24] <arlolra>	 alrighty
[21:47:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage
[21:48:13] <wikibugs>	 (03CR) 10Cwhite: [V: 04-1 C: 04-1] rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[21:50:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Jcross) Hi @BBlack and @Vgutierrez - could you please provide an update or some guidance around your expected timeline for this? Please let us know if anything else is required on our end...
[21:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:50:55] <arlolra>	 urbanecm: ok, let's proceed
[21:50:59] <urbanecm>	 okay, doing
[21:51:51] <wikibugs>	 (03PS2) 10Cwhite: role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889
[21:54:26] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:55:14] <wikibugs>	 (03PS1) 10Dreamy Jazz: Disable write old for CheckUserLog reason field for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885041 (https://phabricator.wikimedia.org/T233004)
[21:55:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:56:44] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]] (duration: 12m 24s)
[21:56:49] <stashbot>	 T328201: Investigate increase in slow parses - https://phabricator.wikimedia.org/T328201
[21:56:52] <urbanecm>	 arlolra: and, live
[21:56:58] <arlolra>	 thank you
[21:57:15] <urbanecm>	 no problem
[21:57:21] <urbanecm>	 i think we're done with the window
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2200).
[22:08:26] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[22:11:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2030.codfw.wmnet with OS bullseye
[22:11:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2030.codfw.wmnet with OS bullseye completed: - cp2030 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[22:13:12] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2333 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[22:16:44] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite)
[22:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:19:47] <wikibugs>	 10SRE, 10Traffic, 10Data Pipelines (Sprint 07): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10odimitrijevic) Pinging @KOfori @BBlack. Please see question above.
[22:20:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:22:06] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2329 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[22:22:42] <wikibugs>	 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10BCornwall) 05Open→03Stalled @ayounsi which commit implemented this? I'm not seeing any reference anywhere
[22:25:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:32:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:35:30] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2332 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[22:36:11] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=cdn
[22:36:12] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be
[22:36:44] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[22:37:37] <AaronSchulz>	 volans, marostegui: do you know why rpl_semi_sync_master_wait_no_slave is 0 ?
[22:38:42] <wikibugs>	 10SRE, 10Traffic-Icebox: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BCornwall) 05Open→03Stalled @BBlack This ticket is quite broad: Can we split any remaining actionable into sub-tickets? From what I'm understanding, new tickets could be:  * Re...
[22:38:57] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS bullseye
[22:39:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye
[22:45:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:50:00] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3053.esams.wmnet with OS bullseye
[22:50:06] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye executed with errors: - cp3053 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[22:55:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:58:58] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet
[23:00:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:04:43] <wikibugs>	 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10Dzahn) Here is a crude shell script from the past trying to solve this problem. No warranty :)  https://people.wikimedia.org/~dzahn/bastion.sh.txt
[23:06:14] <wikibugs>	 (03PS1) 10Sbailey: Enable Linter write namespace, tag and template for group0 and group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612)
[23:07:05] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS bullseye
[23:07:11] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye
[23:09:04] <wikibugs>	 (03CR) 10Sbailey: "Group 0 went smoothly, onto group 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[23:10:54] <wikibugs>	 (03PS1) 10Gergő Tisza: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048
[23:16:31] <urbanecm>	 jouncebot: nowandnext
[23:16:31] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2200)
[23:16:32] <jouncebot>	 In 3 hour(s) and 43 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0300)
[23:16:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza)
[23:21:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] etherpad: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884396 (https://phabricator.wikimedia.org/T327974) (owner: 10Dzahn)
[23:23:31] <wikibugs>	 (03PS1) 10Dzahn: etherpad: fix typo in blackbox::check class parameter name [puppet] - 10https://gerrit.wikimedia.org/r/885050
[23:26:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] etherpad: fix typo in blackbox::check class parameter name [puppet] - 10https://gerrit.wikimedia.org/r/885050 (owner: 10Dzahn)
[23:26:49] <wikibugs>	 (03PS1) 10Dreamy Jazz: Remove redundant definition of wgCheckUserEnableSpecialInvestigate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885051
[23:29:49] <logmsgbot>	 !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3053.esams.wmnet with OS bullseye
[23:29:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye executed with errors: - cp3053 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[23:30:13] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS bullseye
[23:30:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5027.eqsin.wmnet with OS bullseye
[23:45:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable