[00:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:08:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:09:12] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:12:42] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:17:24] PROBLEM - MariaDB Replica Lag: s4 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1029.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:17:44] PROBLEM - MariaDB Replica Lag: s1 on db1140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1055.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:18:28] PROBLEM - MariaDB Replica Lag: s3 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1099.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:41:26] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:43:10] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:02:56] RECOVERY - MariaDB Replica Lag: s1 on db1140 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:48] RECOVERY - MariaDB Replica Lag: s3 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:20:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:38:36] RECOVERY - MariaDB Replica Lag: s4 on db2099 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:44:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:32:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:29:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:30:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [05:30:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:30:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:30:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43439 and previous config saved to /var/cache/conftool/dbconfig/20230130-053033-ladsgroup.json [05:30:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:07:13] (03CR) 10Winston Sung: [C: 03+1] Update cxserver to 2023-01-23-123356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry) [06:11:43] (03CR) 10Kevin Bazira: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [06:13:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [06:13:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [06:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43440 and previous config saved to /var/cache/conftool/dbconfig/20230130-061401-ladsgroup.json [06:14:06] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:15:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:15:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T318605)', diff saved to https://phabricator.wikimedia.org/P43441 and previous config saved to /var/cache/conftool/dbconfig/20230130-061534-ladsgroup.json [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:34:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:24] !log dbmaint Schema change on s6 eqiad T328086 [06:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:28] T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086 [06:35:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [06:35:31] T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022 [06:36:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [06:36:27] Amir1: Any chances you can stop your maintenance on s4? [06:36:36] I need to switchover s4 codfw master for the switches upgrade [06:36:37] marostegui: only with bribe [06:36:49] Amir1: If you stop it I promise you I won't bring s4 down [06:36:55] is that good enough?? [06:36:56] sold [06:37:18] \o/ [06:37:29] Amir1: it shouldn't take long :) [06:37:45] let me know once done [06:37:48] will do [06:37:54] can I restart replication on db2140? [06:37:58] actually I'm running alter table on one of them, would that impact it? [06:38:08] will it take long? [06:38:16] let me take a look [06:38:33] is it running on db2140? [06:38:43] yeah, it's externallinks [06:38:54] yeah, so I need it to get finished before I can proceed [06:39:51] marostegui: it's an alter table, if you kill it it's fine, I'll restart it, just ping me once done. Would that be okay? [06:39:55] no no [06:39:57] it is fine [06:39:58] I can wait [06:40:11] let me see how long it'll take [06:41:13] !log dbmaint Schema change on s8 eqiad T328086 [06:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:17] I have some bad news, it's going to take 6 hours [06:41:17] T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086 [06:41:25] Amir1: that is ok [06:43:05] !log dbmaint Schema change on s7 eqiad T328086 [06:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:37] !log dbmaint Schema change on s2 eqiad T328086 [06:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:22] (03PS1) 10Marostegui: site.pp: Move db1195 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/884722 (https://phabricator.wikimedia.org/T327995) [06:51:03] !log dbmaint Schema change on s5 eqiad T328086 [06:51:06] (03CR) 10Marostegui: [C: 03+2] site.pp: Move db1195 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/884722 (https://phabricator.wikimedia.org/T327995) (owner: 10Marostegui) [06:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:07] T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086 [06:52:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43443 and previous config saved to /var/cache/conftool/dbconfig/20230130-065247-ladsgroup.json [06:52:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:55:44] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:55:56] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:56:07] ^ me [06:58:00] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:58:30] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) [06:58:34] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui) [06:58:50] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) The task gets generated fine, but still a bit unreadable as show on T328135 Leaving this task open until @MoritzMuehlenhoff takes a look... [06:59:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43444 and previous config saved to /var/cache/conftool/dbconfig/20230130-065943-ladsgroup.json [06:59:48] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:59:48] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:01:16] !log dbmaint Schema change on s4 eqiad T328086 [07:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:20] T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086 [07:02:27] !log dbmaint Schema change on s1 eqiad T328086 [07:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:48] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:05:14] !log dbmaint Schema change on s3 eqiad T328086 [07:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:22] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:05:34] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:05:36] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:05:38] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43445 and previous config saved to /var/cache/conftool/dbconfig/20230130-070753-ladsgroup.json [07:10:08] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:10:31] !log dbmaint Schema change on s8 eqiad T328236 [07:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:34] T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236 [07:11:00] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:11:28] !log dbmaint Schema change on s5 eqiad T328236 [07:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:30] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:14:32] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:14:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43446 and previous config saved to /var/cache/conftool/dbconfig/20230130-071450-ladsgroup.json [07:16:55] !log dbmaint Schema change on s6 eqiad T328236 [07:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:59] T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236 [07:17:52] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:17:57] !log dbmaint Schema change on s4 eqiad T328236 [07:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:47] !log dbmaint Schema change on s1 eqiad T328236 [07:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43447 and previous config saved to /var/cache/conftool/dbconfig/20230130-072300-ladsgroup.json [07:25:26] !log dbmaint Schema change on s1 eqiad T328236 [07:25:28] !log dbmaint Schema change on s2 eqiad T328236 [07:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:32] T328236: Add default value to cul_reason on WMF wikis - https://phabricator.wikimedia.org/T328236 [07:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:29] !log dbmaint Schema change on s7 eqiad T328236 [07:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43448 and previous config saved to /var/cache/conftool/dbconfig/20230130-072956-ladsgroup.json [07:32:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318605)', diff saved to https://phabricator.wikimedia.org/P43449 and previous config saved to /var/cache/conftool/dbconfig/20230130-073806-ladsgroup.json [07:38:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [07:38:11] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:38:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [07:38:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43450 and previous config saved to /var/cache/conftool/dbconfig/20230130-073827-ladsgroup.json [07:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318605)', diff saved to https://phabricator.wikimedia.org/P43451 and previous config saved to /var/cache/conftool/dbconfig/20230130-074502-ladsgroup.json [07:45:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:46:14] (03CR) 10Kosta Harlan: [C: 03+1] Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza) [07:48:39] T327867!log installing install2004 [07:48:40] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [07:50:15] !log installing install2004 T327867 [07:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:22] !log phedenskog@deploy1002 Started deploy [performance/navtiming@bfbd6d7]: (no justification provided) [07:54:28] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bfbd6d7]: (no justification provided) (duration: 00m 05s) [08:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:15] no gerrit patches :) [08:00:46] * zabe is going to deploy a sec patch [08:01:04] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43452 and previous config saved to /var/cache/conftool/dbconfig/20230130-081011-ladsgroup.json [08:10:21] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:19:21] !log zabe: Deployed security patch for T278365 [08:23:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43454 and previous config saved to /var/cache/conftool/dbconfig/20230130-082517-ladsgroup.json [08:28:30] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:28:44] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:29:23] (03PS1) 10Jcrespo: Add the "very_stale" HTML style as a red label [software/pampinus] - 10https://gerrit.wikimedia.org/r/884820 [08:30:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:30:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:30:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43455 and previous config saved to /var/cache/conftool/dbconfig/20230130-083034-ladsgroup.json [08:30:39] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [08:39:13] (03CR) 10Ladsgroup: [C: 04-1] Enable Linter write namespace, tag and template from core, group0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [08:40:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43456 and previous config saved to /var/cache/conftool/dbconfig/20230130-084024-ladsgroup.json [08:40:43] (03CR) 10Marostegui: "I am trying to think a good way to deploy this safely. The change looks good, but maybe we should disable puppet on all databases, get thi" [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [08:42:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43457 and previous config saved to /var/cache/conftool/dbconfig/20230130-084213-ladsgroup.json [08:42:17] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [08:48:46] !log installing install1004 T327867 [08:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:50] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [08:51:06] (03CR) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [08:53:00] (03CR) 10Marostegui: mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [08:55:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318605)', diff saved to https://phabricator.wikimedia.org/P43458 and previous config saved to /var/cache/conftool/dbconfig/20230130-085530-ladsgroup.json [08:55:35] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:56:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, minus the fact you also need to add the services to the allowed_listeners list below (see comment inline)" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [08:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43459 and previous config saved to /var/cache/conftool/dbconfig/20230130-085719-ladsgroup.json [08:57:52] (03PS3) 10Giuseppe Lavagetto: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [09:12:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P43460 and previous config saved to /var/cache/conftool/dbconfig/20230130-091225-ladsgroup.json [09:17:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39297/console" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [09:18:18] (03CR) 10Clément Goubert: "LGTM, question inline," [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 (owner: 10Giuseppe Lavagetto) [09:19:14] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron) [09:19:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [09:19:49] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove recording rule for CPU benchmark. [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [09:22:14] (03PS2) 10Awight: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [09:23:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) [09:23:31] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883933 (https://phabricator.wikimedia.org/T328015) (owner: 10Clément Goubert) [09:25:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Hi, and thanks for taking this on. In fact, we have a task dedicated to this problem, https://phabricator.wikimedia.org/T292818, and I'm w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [09:25:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) a:05Clement_Goubert→03herron Handing off to this week's Clinic Duty SRE. @herron you should just have to merge the CR and create the... [09:27:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43461 and previous config saved to /var/cache/conftool/dbconfig/20230130-092732-ladsgroup.json [09:27:33] (03CR) 10WMDE-Fisch: [C: 03+1] Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [09:27:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [09:27:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [09:27:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [09:28:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43462 and previous config saved to /var/cache/conftool/dbconfig/20230130-092804-ladsgroup.json [09:28:39] (03CR) 10Filippo Giunchedi: "A recommendation inline re: readability, otherwise LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [09:29:01] !log disabling puppet on dbprov2004 to reorganize partitions T327155 [09:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:04] T327155: Setup dbprov1004 an dbprov2004 as an expansion of the dbprov (database provisioning) cluster, in preparation of binlog backups backup implementation - https://phabricator.wikimedia.org/T327155 [09:32:26] (03CR) 10Filippo Giunchedi: [C: 03+1] sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto) [09:38:11] (03CR) 10Klausman: [C: 03+1] Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:38:17] (03PS3) 10Awight: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [09:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43463 and previous config saved to /var/cache/conftool/dbconfig/20230130-093941-ladsgroup.json [09:39:46] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [09:39:51] (03PS3) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) [09:40:04] (03CR) 10Btullis: Update the spark images to remove upstream support for the webhook (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [09:40:48] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [09:44:51] (03CR) 10Klausman: [C: 03+1] admin_ng: update ml-serve-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:47:01] (03CR) 10Klausman: role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:47:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change LGTM; the check_all_memcached.php nagios check is now unused and I'll remove it. I'll rebase this patch on top of that change a" [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [09:47:25] Is there anything usual happening with the SSH bastions? I'm having no luck logging in through bast1003 or bast3006. [09:47:42] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Ladsgroup) Mailman really doesn't have an owner yet. Kunal and I did just the upgrade from 2 to 3 due its severe limitations and security issues. I have way... [09:48:11] *unusual [09:49:03] awight: not afaict, I'm using bast3006 and it works [09:49:32] awight: are you getting some error messages? what does your ssh config look like? [09:49:36] Thanks for the confirmation! Something's happening now but *very* slowly, it must be my network. [09:49:44] (and I'm finally in) [09:49:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If we let puppet pick systemd as the agent, then we also need to probably change the restart command to be a systemd-driven reload." [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [09:51:03] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy) [09:51:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [09:52:05] (03Merged) 10jenkins-bot: Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [09:52:13] !log push pfw policies - T328085 [09:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:23] !log awight@deploy1002 Started scap: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]] [09:52:26] T326317: Deploy geoshape expansion to wikis - https://phabricator.wikimedia.org/T326317 [09:54:05] !log awight@deploy1002 lilients and awight: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:54:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43464 and previous config saved to /var/cache/conftool/dbconfig/20230130-095447-ladsgroup.json [09:59:24] (03CR) 10Klausman: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [09:59:29] 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): Rack new cloud-dev servers in same rack - https://phabricator.wikimedia.org/T267662 (10ayounsi) [10:00:17] !log awight@deploy1002 Finished scap: Backport for [[gerrit:879559|Enable kartographer external data parse time fetch for all wikis (T326317)]] (duration: 07m 53s) [10:00:21] T326317: Deploy geoshape expansion to wikis - https://phabricator.wikimedia.org/T326317 [10:02:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) I don't have any issue with that. Cabling is at your discretion. [10:04:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 49544 [10:08:29] ACKNOWLEDGEMENT - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-01-26 00:09:59 Jcrespo rerunning after refactoring issues - The acknowledgement expires at: 2023-01-31 07:05:37. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:08:29] ACKNOWLEDGEMENT - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2139) taken more than 3 days ago: Most recent backup 2023-01-25 11:41:40 Jcrespo rerunning after refactoring issues - The acknowledgement expires at: 2023-01-31 07:05:37. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43465 and previous config saved to /var/cache/conftool/dbconfig/20230130-100954-ladsgroup.json [10:11:16] (03CR) 10Jelto: [C: 03+1] "lgtm, I'll import the actual secrets to private puppet in a moment" [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:11:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49544 [10:15:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 14593 [10:16:31] (03PS1) 10Jcrespo: bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) [10:16:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4003.wikimedia.org [10:17:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts bast4003.wikimedia.org [10:17:21] (03PS2) 10Jcrespo: bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) [10:17:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14593 [10:19:38] (03CR) 10Marostegui: [C: 03+1] bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [10:19:47] (03PS1) 10Muehlenhoff: Remove previos bastions from bastion_host list [puppet] - 10https://gerrit.wikimedia.org/r/884832 (https://phabricator.wikimedia.org/T324974) [10:20:13] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the maximum number of volumes on es-rw backups to 250 [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [10:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:21:01] (03PS1) 10FNegri: P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273) [10:21:24] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove previos bastions from bastion_host list [puppet] - 10https://gerrit.wikimedia.org/r/884832 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [10:25:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43466 and previous config saved to /var/cache/conftool/dbconfig/20230130-102500-ladsgroup.json [10:25:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:25:05] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [10:25:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:25:20] (03CR) 10Jelto: [C: 03+2] "I added the files to private puppet" [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:26:35] (03PS1) 10Muehlenhoff: Update Cumin alias for bastion canary [puppet] - 10https://gerrit.wikimedia.org/r/884834 [10:27:18] (03CR) 10Jcrespo: [C: 03+2] "update command needs to be run after deploy so it gets sent from the director to the storage daemons." [puppet] - 10https://gerrit.wikimedia.org/r/884831 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [10:28:50] (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin alias for bastion canary [puppet] - 10https://gerrit.wikimedia.org/r/884834 (owner: 10Muehlenhoff) [10:29:02] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [10:29:46] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10jijiki) 05Open→03Resolved This work is done [10:30:50] (03PS1) 10Ladsgroup: Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662) [10:30:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4003.wikimedia.org [10:31:39] (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [10:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:33:17] (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [10:33:54] (03CR) 10Krinkle: Remove nutcracker from cloudweb hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [10:34:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:26] jouncebot: nowandnext [10:34:26] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [10:34:26] In 0 hour(s) and 25 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1100) [10:34:33] good [10:34:38] (03CR) 10Ladsgroup: [C: 03+2] Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:35:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:35:22] (03Merged) 10jenkins-bot: Enable write both for externallinks except s4, s7, s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884837 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:35:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:35:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43467 and previous config saved to /var/cache/conftool/dbconfig/20230130-103540-ladsgroup.json [10:35:44] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [10:36:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:36:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]] [10:36:20] (03CR) 10Jelto: [V: 03+2 C: 03+2] jenkins: add secrets for releasing instance [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:36:22] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [10:37:56] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:38:20] (03PS4) 10Thiemo Kreuz (WMDE): Remove some unused LAMP config [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (https://phabricator.wikimedia.org/T292818) (owner: 10Awight) [10:38:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273) (owner: 10FNegri) [10:40:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:43:12] okay, tested in a wiki in s5, s6 and s1, the replication didn't break [10:43:16] moving forward [10:46:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:46:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast4003.wikimedia.org [10:46:59] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast4003.wikimedia.org` - bast4003.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager... [10:47:06] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) >>! In T306995#8128358, @Michael wrote: > Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata... [10:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43468 and previous config saved to /var/cache/conftool/dbconfig/20230130-104735-ladsgroup.json [10:47:40] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [10:48:42] 10SRE, 10Data-Engineering, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10kostajh) I'm writing th... [10:49:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:884837|Enable write both for externallinks except s4, s7, s8 (T321662)]] (duration: 13m 10s) [10:49:33] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [10:51:48] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10elukey) [10:54:37] (03CR) 10FNegri: [C: 03+2] P:wmcs::services: simplify toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884833 (https://phabricator.wikimedia.org/T328273) (owner: 10FNegri) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1100) [11:01:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org [11:01:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:02:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43470 and previous config saved to /var/cache/conftool/dbconfig/20230130-110241-ladsgroup.json [11:03:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4002.wikimedia.org - jmm@cumin2002" [11:04:35] (03PS1) 10Muehlenhoff: Remove bast4003 [puppet] - 10https://gerrit.wikimedia.org/r/884845 (https://phabricator.wikimedia.org/T324974) [11:04:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4002.wikimedia.org - jmm@cumin2002" [11:04:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:04:39] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4002.wikimedia.org on all recursors [11:04:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install4002.wikimedia.org on all recursors [11:05:21] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [11:06:05] (03PS3) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) [11:06:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Centralize and change wikiadmin user grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [11:09:07] !log phedenskog@deploy1002 Started deploy [performance/navtiming@4e5ff3f]: (no justification provided) [11:09:12] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@4e5ff3f]: (no justification provided) (duration: 00m 05s) [11:11:59] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [11:12:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast4003 [puppet] - 10https://gerrit.wikimedia.org/r/884845 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [11:17:21] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:24] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet [11:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43471 and previous config saved to /var/cache/conftool/dbconfig/20230130-111748-ladsgroup.json [11:19:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install4002.wikimedia.org [11:22:29] (03PS1) 10Muehlenhoff: Add install4002 [puppet] - 10https://gerrit.wikimedia.org/r/884854 [11:24:33] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet [11:24:55] (03CR) 10Muehlenhoff: [C: 03+2] Add install4002 [puppet] - 10https://gerrit.wikimedia.org/r/884854 (owner: 10Muehlenhoff) [11:27:33] (03CR) 10Jbond: [C: 03+1] httpd: Let Puppet pick the init provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:27:41] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 132, down: 43, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:10] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet [11:31:13] (03PS1) 10Ladsgroup: Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856 [11:32:06] (03CR) 10Muehlenhoff: httpd: Let Puppet pick the init provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43472 and previous config saved to /var/cache/conftool/dbconfig/20230130-113254-ladsgroup.json [11:32:55] (03PS5) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) [11:32:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:32:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:32:59] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [11:33:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:33:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:33:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:33:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43473 and previous config saved to /var/cache/conftool/dbconfig/20230130-113319-ladsgroup.json [11:33:57] (03CR) 10Marostegui: [C: 03+1] Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856 (owner: 10Ladsgroup) [11:35:03] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:12] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet [11:35:28] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet [11:35:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [11:40:13] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:31] !log dropping old wikiadmin user (T326802) [11:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:35] T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802 [11:42:22] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet [11:42:57] !log installing install4002 T327867 [11:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [11:43:23] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Drop unused wikiuser2 password [labs/private] - 10https://gerrit.wikimedia.org/r/884856 (owner: 10Ladsgroup) [11:44:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43474 and previous config saved to /var/cache/conftool/dbconfig/20230130-114424-ladsgroup.json [11:44:29] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [11:48:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 42473 [11:49:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42473 [11:49:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast6001.wikimedia.org [11:51:42] (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [11:54:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:56:00] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:57:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:57:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:57:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast6001.wikimedia.org [11:57:29] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast6001.wikimedia.org` - bast6001.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager... [11:58:21] (03PS1) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) [11:59:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43475 and previous config saved to /var/cache/conftool/dbconfig/20230130-115930-ladsgroup.json [12:04:24] (03PS2) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) [12:04:59] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:06:48] Hi, there is a sec patch on wikifeeds waiting for deployment. Is it OK to deploy now ? [12:07:06] (03Abandoned) 10Muehlenhoff: Move ssh-key-ldap-lookup to profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/880883 (owner: 10Muehlenhoff) [12:07:13] (03Abandoned) 10Muehlenhoff: Remove ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/880884 (owner: 10Muehlenhoff) [12:07:35] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:11:34] jouncebot: now [12:11:34] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [12:11:52] nemo-yiannis: I think you can probably deploy [12:12:21] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3004.wikimedia.org [12:12:41] thanks Lucas_WMDE [12:12:59] (since I’m not seeing any objections ^^) [12:13:02] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110 (owner: 10PipelineBot) [12:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43476 and previous config saved to /var/cache/conftool/dbconfig/20230130-121437-ladsgroup.json [12:16:45] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Increase inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:18:17] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110 (owner: 10PipelineBot) [12:18:43] (03CR) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [12:22:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:23:05] !log awight@deploy1002 Started deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad [12:24:34] (03PS1) 10Muehlenhoff: Remove obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/884876 [12:25:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:25:49] !log awight@deploy1002 Finished deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad (duration: 02m 44s) [12:26:48] 10SRE, 10CommRel-Specialists-Support, 10serviceops, 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) [12:27:19] PROBLEM - kartotherian endpoints health on maps2006 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:27:19] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:27:19] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:27:49] PROBLEM - kartotherian endpoints health on maps1007 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:27:49] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:27:49] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:27:57] PROBLEM - kartotherian endpoints health on maps2009 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:27:57] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:27:57] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:28:15] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 4 [12:28:15] cting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geol [12:28:15] eojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [12:28:33] PROBLEM - kartotherian endpoints health on maps1010 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:28:33] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:28:33] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:28:33] PROBLEM - kartotherian endpoints health on maps1008 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:28:34] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:28:34] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:28:39] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:28:39] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:28:39] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:28:45] (03CR) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [12:29:26] PROBLEM - kartotherian endpoints health on maps1006 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:29:26] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:29:26] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:29:33] awight: ^ expected? [12:29:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [12:29:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43477 and previous config saved to /var/cache/conftool/dbconfig/20230130-122943-ladsgroup.json [12:29:44] (03CR) 10Jbond: "will abandon this chnage as it no longer seems neccesary" [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [12:29:44] PROBLEM - kartotherian endpoints health on maps2008 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:29:44] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:29:44] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:29:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:29:48] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [12:29:48] (03Abandoned) 10Jbond: blackbox::check::http: change expiry check value from days to seconds [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [12:29:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43478 and previous config saved to /var/cache/conftool/dbconfig/20230130-123004-ladsgroup.json [12:30:18] (03CR) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:33:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::grid: install python3-mwparserfromhell [puppet] - 10https://gerrit.wikimedia.org/r/882220 (https://phabricator.wikimedia.org/T327600) (owner: 10Majavah) [12:35:14] PROBLEM - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 4 [12:35:14] cting: 200): /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geol [12:35:14] eojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geopoint?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [12:35:14] PROBLEM - kartotherian endpoints health on maps2005 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{src}/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting [12:35:15] /private-info/info.json (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /geoline?getgeojso [12:35:15] {ids} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200): /geoshape?getgeojson=1&ids={ids} (Untitled test) is CRITICAL: Test Untitled test https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:41:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43479 and previous config saved to /var/cache/conftool/dbconfig/20230130-124142-ladsgroup.json [12:41:47] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [12:44:17] awight: Ping? Are the above Kartotherian errors related to Finished deploy [kartotherian/deploy@42a07d3]: Disable traffic mirroring from codfw to eqiad (duration: 02m 44s) ? [12:44:20] (03PS4) 10Jbond: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:44:41] claime: Yes definitely the fault of this deployment. I'll roll back now. [12:44:50] (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:44:52] (03PS4) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [12:44:55] (03PS5) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [12:45:08] !log awight@deploy1002 Started deploy [kartotherian/deploy@5c58f8f]: Roll back kartotherian [12:46:35] !log awight@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f]: Roll back kartotherian (duration: 01m 27s) [12:48:04] PROBLEM - kartotherian endpoints health on maps1006 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:48:04] PROBLEM - kartotherian endpoints health on maps1005 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:48:04] PROBLEM - kartotherian endpoints health on maps2007 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [12:48:52] claime: Rolled back now--are the alerts any healthier? [12:49:27] No more 301s but I'm still getting 400 for osm-intl/info.json [12:49:49] Which URL? I see https://maps.wikimedia.org/osm-intl/info.json is responding correctly from the browser. [12:52:11] (03PS1) 10Awight: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) [12:53:41] (03PS5) 10Jbond: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:53:43] (03PS1) 10Jbond: Puppetfile: fix whitespace issue on puppetfile [puppet] - 10https://gerrit.wikimedia.org/r/884880 [12:53:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [12:54:00] (03PS2) 10Awight: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) [12:54:06] awight: They're the service checks [12:54:07] (03CR) 10Jbond: [C: 03+2] Puppetfile: fix whitespace issue on puppetfile [puppet] - 10https://gerrit.wikimedia.org/r/884880 (owner: 10Jbond) [12:54:09] (03CR) 10TrainBranchBot: "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [12:54:11] (03PS3) 10EoghanGaffney: Send vrts httpd logs to kafka for ingestion to logstash [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) [12:54:54] (03Merged) 10jenkins-bot: Revert "Enable kartographer external data parse time fetch for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [12:55:12] !log awight@deploy1002 Started scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] [12:55:13] !log awight@deploy1002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --force-version "1.40.0-wmf.20" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.2oaGSEpQR1"' returned non-zero exit status 255. (duration: 00m 00s) [12:55:17] T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113 [12:55:19] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:55:43] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:55:58] (03CR) 10WMDE-Fisch: [C: 03+1] "Just for bookkeeping." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884496 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [12:56:25] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:56:44] (03CR) 10Jbond: "lgtm see question" [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:56:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43481 and previous config saved to /var/cache/conftool/dbconfig/20230130-125648-ladsgroup.json [12:57:02] dancy: ^ odd scap backport issue in the logs above [12:57:08] (03PS1) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [12:57:15] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:57:17] (03CR) 10Jbond: redfish: store all manager info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [12:57:43] (03CR) 10Jbond: redfish: store all manager info for later use (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [12:58:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:58:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3004.wikimedia.org [12:58:16] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3004.wikimedia.org` - bast3004.wikimedia.org (**WARN**) - Downtimed host on Icinga/Alertmanager... [12:58:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8565480, @Papaul wrote: > @cmooney this looks good to me just one question. Is... [12:59:28] _joe_: Krinkle: https://gerrit.wikimedia.org/r/c/operations/puppet/+/880561 might be related with the scap errors above [12:59:30] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:00:14] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:00:18] (03PS1) 10Muehlenhoff: Remove bast3004/bast6001 [puppet] - 10https://gerrit.wikimedia.org/r/884882 (https://phabricator.wikimedia.org/T325361) [13:02:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3004/bast6001 [puppet] - 10https://gerrit.wikimedia.org/r/884882 (https://phabricator.wikimedia.org/T325361) (owner: 10Muehlenhoff) [13:02:49] claime: If this is in the "3/3 HARD" failure state, does it require manual intervention to refresh the checks? [13:03:04] (03PS2) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:03:12] awight: I've already forced them once, I'll retry [13:06:01] (03Abandoned) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:07:39] <_joe_> taavi: uh I did test a maintenance script :/ [13:08:50] <_joe_> sorry I was at lunch [13:09:02] <_joe_> awight: are you waiting for a fix? [13:09:53] _joe_: No worries. I have a half-deployed revert but mostly just happy to hear that my breakage is limited to maps. [13:10:09] I'm trying to debug rn but I can't find the inciga checks [13:10:11] So don't rush, but please do ping me when the script is fixed. [13:10:27] I have one that uses service-checker-swagger [13:10:47] claime: In case it's helpful, https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=maps1005&service=kartotherian+endpoints+health [13:10:47] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: Add auto_prepend_file to PHP config_cli" [puppet] - 10https://gerrit.wikimedia.org/r/884497 [13:10:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "mediawiki: Add auto_prepend_file to PHP config_cli" [puppet] - 10https://gerrit.wikimedia.org/r/884497 (owner: 10Giuseppe Lavagetto) [13:11:52] <_joe_> puppet is running [13:11:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43482 and previous config saved to /var/cache/conftool/dbconfig/20230130-131155-ladsgroup.json [13:12:06] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:43] <_joe_> awight: green light! [13:13:04] Ah right, the revert didn't finish that's why we're broken [13:13:05] <_joe_> and apologies again for the breakage, I should've tested a deployment [13:13:06] ok [13:13:16] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [13:13:32] <_joe_> awight: you can re-deploy whenever you want [13:13:41] _joe_: thanks [13:14:03] claime: This revert is related, but in a different component. kartotherian should have recovered already. [13:14:37] !log awight@deploy1002 Started scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] [13:14:41] T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113 [13:14:52] (03CR) 10Ayounsi: "LGTM! To be deployed after sending communication to sre-at-large@ (or public ops list) as it can impact people/apps flows (even though it " [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [13:15:03] (03CR) 10Ayounsi: [C: 03+1] P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [13:16:12] !log awight@deploy1002 awight: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:17:24] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:44] [2023-01-30T12:47:32.575Z] ERROR: kartotherian/580 on maps1005: Unable to create source "osm"Source "osm-pbf" is disabled, possibly due to loading errors (err.levelPath=error) [13:17:46] Err: Source "osm-pbf" is disabled, possibly due to loading errors [13:17:50] On the maps servers [13:18:04] claime: Thanks, we think we found the cause [13:19:08] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:27] (03PS2) 10JMeybohm: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto) [13:20:51] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad [13:21:14] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad (duration: 00m 22s) [13:21:39] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad [13:21:51] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 00m 11s) [13:22:58] claime: This ^ deployment should fix the error you mention [13:22:59] (03Restored) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:23:11] !log awight@deploy1002 Finished scap: Backport for [[gerrit:884496|Revert "Enable kartographer external data parse time fetch for all wikis" (T323113)]] (duration: 08m 34s) [13:23:16] T323113: [Epic] Move geoshape expansion to Kartographer parse-time - https://phabricator.wikimedia.org/T323113 [13:23:27] (03PS3) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:23:37] (03CR) 10JMeybohm: [C: 03+2] "Nice, thanks! ❤️" [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto) [13:24:05] nemo-yiannis: Does it need a service restart to take effect? [13:24:16] I think scap did it already [13:24:30] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:40] Not for maps1005 at least Loaded: loaded (/lib/systemd/system/kartotherian.service; enabled; vendor preset: enabled) [13:24:42] Active: active (running) since Mon 2023-01-30 12:46:06 UTC; 37min ago [13:24:47] (03Merged) 10jenkins-bot: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) (owner: 10Jelto) [13:25:05] true i just checked the actual config [13:25:07] let me think [13:25:07] (that's kartotherian.service) [13:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43483 and previous config saved to /var/cache/conftool/dbconfig/20230130-132701-ladsgroup.json [13:27:06] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:28:00] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad [13:28:02] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:13] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (eqiad): Disable traffic mirroring from codfw to eqiad (duration: 01m 13s) [13:29:20] !log bounce logstash on logstash1025 -- GC unhappy causing kafka lag [13:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] RECOVERY - kartotherian endpoints health on maps1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:29:48] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad [13:30:25] RECOVERY - Kartotherian LVS eqiad on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:30:25] RECOVERY - kartotherian endpoints health on maps1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:30:25] RECOVERY - kartotherian endpoints health on maps1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:30:25] RECOVERY - kartotherian endpoints health on maps1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:30:26] RECOVERY - kartotherian endpoints health on maps1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:30:27] RECOVERY - kartotherian endpoints health on maps2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:30:45] claime: Would you say this qualifies for an incident report? I'm happy to write one if so. [13:30:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/884876 (owner: 10Muehlenhoff) [13:31:11] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad (duration: 01m 23s) [13:31:21] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:27] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [13:31:27] RECOVERY - kartotherian endpoints health on maps2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:31:27] RECOVERY - kartotherian endpoints health on maps2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:31:27] RECOVERY - kartotherian endpoints health on maps2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:31:28] RECOVERY - kartotherian endpoints health on maps2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:31:29] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:31:32] What was the actual production impact? (I am not very aware of what karthoterian does) [13:31:58] kartotherian* [13:32:41] <_joe_> uncached map tiles were unavailable to users [13:33:13] So I'd say yes, it's an incident, especially since it lasted ~1h [13:33:34] +1 yes I think a lot of people saw broken maps today [13:33:55] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39302/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:34:19] https://grafana.wikimedia.org/goto/mqZt4GAVz?orgId=1 would agree [13:34:37] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) [13:35:37] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39303/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:35:43] awight: 12:24:52 / 13:30 GMT for the incident window [13:36:12] ty! [13:36:26] * claime afk lunch [13:36:33] I'll start it a bit earlier just because there was a smaller thing I broke with a side deployment :-/ [13:36:38] ack [13:36:49] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39304/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:37:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:38:35] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39305/console" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:40:05] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:30] (03PS4) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:41:51] (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:42:15] 10SRE, 10serviceops, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [13:42:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:43:08] (03PS1) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) [13:43:33] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) Hm, I notice there’s no corresponding `nodejs14-devel` image in the [Docker registry](https://docker-registry.wikimedia.org/), only `nodejs14-slim` (and same for `node... [13:43:37] (03PS5) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:43:53] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:44:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:44:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P43484 and previous config saved to /var/cache/conftool/dbconfig/20230130-134406-ladsgroup.json [13:44:10] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:47:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:47:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:48:05] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:37] (03PS6) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:48:39] (03CR) 10Jelto: [C: 03+1] "lgtm, left one little comment in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:50:06] (03PS7) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:50:43] (03PS4) 10EoghanGaffney: Send vrts httpd logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) [13:51:42] (03PS8) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [13:52:09] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:52:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39309/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [13:53:29] (03CR) 10EoghanGaffney: [C: 03+2] Send vrts httpd logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/884860 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:55:51] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:56:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:56:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43485 and previous config saved to /var/cache/conftool/dbconfig/20230130-135632-ladsgroup.json [13:56:36] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P43486 and previous config saved to /var/cache/conftool/dbconfig/20230130-135659-ladsgroup.json [13:57:40] (03PS1) 10Bartosz Dziewoński: Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) [13:58:07] (03PS1) 10Bartosz Dziewoński: Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1400). [14:00:05] sbailey and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:33] hi [14:00:33] I am here :-) [14:00:34] are we okay to deploy? I saw some alerts earlier [14:01:28] ok, looks like the karthoterian stuff is fine again [14:01:54] I’ll assume it’s okay to deploy unless someone tells me otherwise :) [14:02:32] let’s start with the reverts, those will take a while in CI [14:02:44] thanks [14:02:59] hm, they’re not merged on master yet [14:03:08] but most of the jobs in zuul are done and green [14:03:16] so let’s +2 them [14:03:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:03:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:04:17] Lucas_WMDE: +1 kartotherian should be stable sgain [14:04:25] ok thanks [14:04:35] Ah looks like my patch 884090 (a config patch is missing a default case). Sseeing if I can fix that now. [14:05:33] sbailey: the variables also have an extra indentation level compared to their surroundings [14:06:11] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:22] Arrg, one thing to note also is the extension does provide a default of false, so is that also required here? [14:06:25] Lucas_WMDE: yeah sorry about that. i wasn't planning on doing this when i woke up today :) [14:06:35] ^^ [14:06:51] sbailey: probably better to be explicit and specify the default, I think [14:07:04] I assume this is a temporary setting that will be removed at some point anyway [14:07:06] they are just reverts though, so they should be safe (and it works locally) [14:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43487 and previous config saved to /var/cache/conftool/dbconfig/20230130-140710-ladsgroup.json [14:07:16] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [14:08:44] Should the default be false and the group0 true? [14:09:14] yeah, I think so [14:09:30] (03PS1) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [14:09:48] (03PS1) 10Elukey: ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528) [14:11:29] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:52] (03PS3) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [14:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P43488 and previous config saved to /var/cache/conftool/dbconfig/20230130-141203-ladsgroup.json [14:12:32] (03CR) 10Sbailey: "Set default value and fixed indentation, stupid IDE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:13:13] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:30] I think I got 884090 fixed up [14:13:56] the indentation is still off, sorry [14:14:03] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:05] but the default looks good to me [14:14:47] (03CR) 10Jbond: C:varnish: Rate limit hotlinking dry-run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [14:14:49] (03PS4) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [14:15:06] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:15:13] the last line of each block (“],”) shouldn’t be indented either [14:15:14] ok, now the indentation is fixed [14:15:49] (03PS5) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [14:15:54] Whack a mole with the IDE [14:16:31] It is 6am my time so a bit fuzzy [14:16:35] ok, now it looks good to me [14:16:51] but MatmaRex’ backports are almost done in CI so let’s just do those first [14:17:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:17:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:17:13] sounds good [14:17:36] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [14:17:40] T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022 [14:17:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [14:18:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2140 with weight 0 T328022', diff saved to https://phabricator.wikimedia.org/P43489 and previous config saved to /var/cache/conftool/dbconfig/20230130-141822-root.json [14:18:29] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [14:18:44] (03Merged) 10jenkins-bot: Revert "Remove references to mediawiki.Uri" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884500 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:19:03] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T328022 [14:19:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/883519 (https://phabricator.wikimedia.org/T328022) (owner: 10Gerrit maintenance bot) [14:19:41] (03PS5) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) [14:19:44] (03PS6) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [14:21:03] (03Merged) 10jenkins-bot: Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884501 (https://phabricator.wikimedia.org/T328143) (owner: 10Bartosz Dziewoński) [14:21:19] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]] [14:21:23] T328143: Machine Translation is broken when content has a link - https://phabricator.wikimedia.org/T328143 [14:22:03] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43490 and previous config saved to /var/cache/conftool/dbconfig/20230130-142216-ladsgroup.json [14:22:57] !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:23:12] MatmaRex: can you test the reverts? [14:24:18] yeah [14:24:22] looking [14:24:26] ok [14:25:00] for a second I thought we might do a “can you?” “yes.” “will you?” “yes.” “…” routine :P [14:25:23] (03PS9) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [14:25:25] bah i can't. this is silly [14:25:28] Access to XMLHttpRequest at 'https://cxserver.wikimedia.org/v2/page/fr/pl/Coquille_Saint-Jacques' from origin 'https://pl.wikipedia.org' has been blocked by CORS policy: Request header field x-wikimedia-debug is not allowed by Access-Control-Allow-Headers in preflight response. [14:25:38] blerghl [14:25:44] that’s annoying [14:25:46] very annoying [14:25:51] (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [14:25:53] there’s probably a phab task for it [14:26:03] i wonder if there's some easy way to hack around that [14:26:16] just disable CORS, what could go wrong? [14:26:30] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for revscoring model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/884892 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:26:49] there it is https://phabricator.wikimedia.org/T252826 [14:26:56] well if you disable it, then the thing won't work at all, it needs CORS to work [14:26:59] we can probably just sync this? it’s a revert, should be relatively safe… [14:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P43491 and previous config saved to /var/cache/conftool/dbconfig/20230130-142708-ladsgroup.json [14:27:14] i think it's safe, santhosh said it worked locally for him [14:27:23] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:28] (03CR) 10Kosta Harlan: [C: 03+1] wikireplicas: drop views for pagetriage_log [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah) [14:27:30] oh nevermind, the task I linked is for rest / query service / whatever [14:27:33] but similar at least [14:27:38] ok, syncing [14:27:39] VE itself works fine on mwdebug [14:27:41] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:27:43] i only can't test CX [14:27:58] (03PS1) 10Btullis: Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) [14:28:31] i'll file a bug for this aterwards [14:28:45] thanks [14:29:44] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:29:48] (03PS2) 10Btullis: Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) [14:30:20] (03CR) 10Btullis: [V: 03+2 C: 03+2] Revert changes to the maven proxy configuration that didn't work [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:30:26] (03PS2) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875909 [14:30:32] (03PS1) 10JMeybohm: Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/884900 (https://phabricator.wikimedia.org/T327664) [14:32:01] (03Abandoned) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875909 (owner: 10Matthias Mullie) [14:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:33:27] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:884500|Revert "Remove references to mediawiki.Uri" (T328143)]], [[gerrit:884501|Revert "Rewrite mw.libs.ve.getTargetDataFromHref with URL API" (T328143)]] (duration: 12m 07s) [14:33:31] T328143: Machine Translation is broken when content has a link - https://phabricator.wikimedia.org/T328143 [14:34:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:34:33] :-) [14:34:48] (03Merged) 10jenkins-bot: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:35:03] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]] [14:35:05] (i filed https://phabricator.wikimedia.org/T328310) [14:35:07] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [14:36:07] thanks MatmaRex [14:36:09] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:09] Lucas_WMDE: my reverts are live, right? thanks [14:36:17] they should be, yeha [14:36:17] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q3): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [14:36:18] *yeah [14:36:30] yeah. things are working as expected now [14:36:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and sbailey: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:36:53] (03PS1) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) [14:37:03] sbailey: can you test the change on mwdebug? [14:37:13] (03CR) 10CI reject: [V: 04-1] Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [14:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P43492 and previous config saved to /var/cache/conftool/dbconfig/20230130-143723-ladsgroup.json [14:38:01] <_joe_> jouncebot: now and next5 [14:38:01] For the next 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1400) [14:38:04] (03PS2) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) [14:38:12] This is run from a job queue, so no, but I can look at the database and see if the colummns are being populated [14:38:27] <_joe_> Lucas_WMDE: can you ping me when you're done, if you're doing the deployments? [14:38:27] ok, but probably only after it’s synced everywhere then [14:38:31] _joe_: sure [14:38:33] yes [14:38:37] <_joe_> thanks <3 [14:38:38] ok [14:38:41] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:38:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:38:48] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:38:53] I’ll just quickly check that nothing is broken [14:38:55] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:38:59] (03PS1) 10Filippo Giunchedi: thanos: split wdqs SLIs in a new group [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306) [14:39:30] hm, https://test.wikidata.org/wiki/Special:LintErrors?namespace=8&titlesearch=&exactmatch=1 gives me “namespace and/or pagename not found or malformed” o_O [14:39:37] but it’s the same with or without x-wikimedia-debug [14:39:56] ok, https://test.wikidata.org/wiki/Special:LintErrors?namespace=0&titlesearch=A&exactmatch= works [14:40:15] let’s sync then [14:41:14] yes, last time there were two straggling db's that missed the columns add, that was resolved. [14:41:14] Maybe there are more stragglers, thought Amir did a report that verified all were updated [14:41:19] (03CR) 10Herron: [C: 03+1] "thanks, sgtm for near term fix" [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306) (owner: 10Filippo Giunchedi) [14:41:27] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P43493 and previous config saved to /var/cache/conftool/dbconfig/20230130-144213-ladsgroup.json [14:43:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3051.esams.wmnet with OS bullseye [14:43:18] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3051.esams.wmnet with OS bullseye [14:43:22] If a db missed the addition of linter_namespace and linter_tag and linter_template, the code will error :-( [14:43:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39319/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [14:44:07] * claime back [14:46:14] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:884090|Enable Linter write namespace, tag and template from core, group0 (T299612)]] (duration: 11m 11s) [14:46:19] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [14:46:44] _joe_: I’m done, assuming there are no errors from the last deployment [14:46:56] * Lucas_WMDE sees /tmp/joetest in logwatch ^^ [14:47:07] <_joe_> Lucas_WMDE: erheh ahem [14:47:08] <_joe_> cough [14:47:23] (03CR) 10Herron: [C: 03+2] admin: Add abhas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/883933 (https://phabricator.wikimedia.org/T328015) (owner: 10Clément Goubert) [14:47:40] !log updating puppetdb 7 hosts to 7.12.1 T321783 [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:44] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [14:50:58] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: split wdqs SLIs in a new group [puppet] - 10https://gerrit.wikimedia.org/r/884906 (https://phabricator.wikimedia.org/T328306) (owner: 10Filippo Giunchedi) [14:51:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "This patch was reverted because scap got the following error:" [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [14:52:03] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:26] testing linter errors on testwiki [14:52:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P43494 and previous config saved to /var/cache/conftool/dbconfig/20230130-145229-ladsgroup.json [14:52:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:52:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:52:34] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [14:54:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2140 to s4 primary T328022', diff saved to https://phabricator.wikimedia.org/P43495 and previous config saved to /var/cache/conftool/dbconfig/20230130-145421-root.json [14:54:25] T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022 [14:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2110 T328022', diff saved to https://phabricator.wikimedia.org/P43496 and previous config saved to /var/cache/conftool/dbconfig/20230130-145508-root.json [14:56:54] It is working in testwiki, new errors are being recorded :-) [14:57:23] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:38] 10SRE, 10Infrastructure-Foundations, 10netops: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) p:05Triage→03Low [14:57:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10herron) 05In progress→03Resolved Hi @Abhas, the requested access has been provisioned and will fully propagate across the fleet within 30 minutes. A... [14:58:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43497 and previous config saved to /var/cache/conftool/dbconfig/20230130-145759-root.json [14:58:22] (03PS1) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [14:58:36] sbailey: yay \o/ [14:59:45] (03PS1) 10EoghanGaffney: Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) [14:59:58] :-), looking at Quarry now for group0 [15:00:24] 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10herron) p:05Triage→03Medium [15:00:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:01:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:01:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43498 and previous config saved to /var/cache/conftool/dbconfig/20230130-150132-ladsgroup.json [15:01:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [15:01:40] (03CR) 10CI reject: [V: 04-1] Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [15:01:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:03:49] (03PS2) 10EoghanGaffney: Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) [15:04:17] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) Thanks @jbond for the patch and help! I can confirm that: ` sudo cookbook -vvvv -c /hom... [15:04:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3051.esams.wmnet with reason: host reimage [15:07:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3051.esams.wmnet with reason: host reimage [15:08:05] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43499 and previous config saved to /var/cache/conftool/dbconfig/20230130-151228-ladsgroup.json [15:12:32] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [15:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43500 and previous config saved to /var/cache/conftool/dbconfig/20230130-151304-root.json [15:13:09] (03CR) 10Jelto: [C: 03+1] "lgtm, but a second review from observability would be great :)" [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [15:13:59] !log Retrospective: Starting s4 codfw failover from db2110 to db2140 - T328022 [15:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:03] T328022: Switchover s4 master (db2110 -> db2140) - https://phabricator.wikimedia.org/T328022 [15:16:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [15:16:43] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:11] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:39] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [15:22:41] (03PS3) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [15:22:43] (03PS3) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [15:23:19] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:22] (03CR) 10Jelto: "one question in line" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:26:32] (03PS3) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) [15:26:34] (03PS1) 10JMeybohm: Drop profile::ci::kubernetes_config [puppet] - 10https://gerrit.wikimedia.org/r/884915 [15:26:58] (03CR) 10CI reject: [V: 04-1] Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:27:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43501 and previous config saved to /var/cache/conftool/dbconfig/20230130-152734-ladsgroup.json [15:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43502 and previous config saved to /var/cache/conftool/dbconfig/20230130-152809-root.json [15:29:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2029.codfw.wmnet with OS bullseye [15:29:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2029.codfw.wmnet with OS bullseye [15:29:55] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:57] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:20] (03PS4) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) [15:31:11] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:45] (03CR) 10JMeybohm: Switch the active staging cluster to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:31:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3051.esams.wmnet with OS bullseye [15:31:58] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3051.esams.wmnet with OS bullseye completed: - cp3051 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [15:32:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:33:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39320/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:34:53] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10colewhite) [15:35:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [15:36:19] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:03] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:31] (03PS4) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [15:41:35] (03CR) 10Jelto: "I found one more kubestagemaster.svc.eqiad.wmnet in releases configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:42:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P43503 and previous config saved to /var/cache/conftool/dbconfig/20230130-154241-ladsgroup.json [15:43:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43504 and previous config saved to /var/cache/conftool/dbconfig/20230130-154314-root.json [15:43:17] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:13] (03CR) 10Ottomata: [C: 03+1] "Ya weird that it doesn't work!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884896 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:46:12] (03PS5) 10JMeybohm: Switch the active staging cluster to codfw [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) [15:46:31] (03CR) 10JMeybohm: Switch the active staging cluster to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:47:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39321/console" [puppet] - 10https://gerrit.wikimedia.org/r/884905 (https://phabricator.wikimedia.org/T327664) (owner: 10JMeybohm) [15:48:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage [15:50:45] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage [15:52:07] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:52] (03PS1) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [15:54:06] (03PS5) 10Jbond: redfish: remove dell specific name from Redfish class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [15:54:08] (03PS4) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [15:54:10] (03PS1) 10Jbond: redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921 [15:54:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS bullseye [15:54:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye [15:55:20] (03CR) 10CI reject: [V: 04-1] feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [15:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:57:25] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:38] (03CR) 10Ottomata: [C: 03+1] "+1 but one Q/naming nit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [15:57:43] (03CR) 10CI reject: [V: 04-1] redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921 (owner: 10Jbond) [15:57:45] (03CR) 10Ottomata: [C: 03+1] "No worries if not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [15:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P43505 and previous config saved to /var/cache/conftool/dbconfig/20230130-155747-ladsgroup.json [15:57:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:57:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:57:52] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [15:57:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:57:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:57:59] (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [15:58:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43506 and previous config saved to /var/cache/conftool/dbconfig/20230130-155802-ladsgroup.json [15:58:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43507 and previous config saved to /var/cache/conftool/dbconfig/20230130-155819-root.json [15:59:25] (03PS1) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) [15:59:44] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet [15:59:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet [16:01:13] (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [16:03:27] !log upgrading idp-test to latest Java security update [16:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] (03PS2) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) [16:03:54] !log racreset cp3050.esams.wmnet: firmware cookbook iDRAC upgrade test [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:01] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10Patch-For-Review: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10isarantopoulos) In the patch above I convert the dictionary passed in `form_body` field to json if there is the header `Content-Type... [16:05:00] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5026.eqsin.wmnet with OS bullseye [16:05:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye executed with errors: - cp5026 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:05:33] (03CR) 10Elukey: wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [16:05:42] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-01-30 12:16:40 (1170 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:05:43] (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [16:06:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS bullseye [16:06:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye [16:08:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43508 and previous config saved to /var/cache/conftool/dbconfig/20230130-160829-ladsgroup.json [16:08:35] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:10:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3050.esams.wmnet,service=cdn [16:10:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3050.esams.wmnet,service=ats-be [16:10:19] (03CR) 10Elukey: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [16:10:35] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet [16:10:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet [16:11:05] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3050.esams.wmnet [16:13:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43509 and previous config saved to /var/cache/conftool/dbconfig/20230130-161324-root.json [16:15:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:16:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2029.codfw.wmnet with OS bullseye [16:16:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2029.codfw.wmnet with OS bullseye completed: - cp2029 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [16:17:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3050.esams.wmnet [16:17:34] (03PS2) 10Ilias Sarantopoulos: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) [16:18:13] (03CR) 10Ottomata: [C: 03+1] wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [16:19:17] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [16:21:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3050.esams.wmnet with OS bullseye [16:21:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3050.esams.wmnet with OS bullseye [16:22:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=cdn [16:22:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be [16:22:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=cdn [16:22:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=ats-be [16:23:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43510 and previous config saved to /var/cache/conftool/dbconfig/20230130-162336-ladsgroup.json [16:24:44] (03PS2) 10Elukey: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) [16:24:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1084.eqiad.wmnet [16:25:21] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye [16:25:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye [16:25:30] (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [16:25:35] (03CR) 10Elukey: wmf-config: add new revision-score streams for EventGate main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [16:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:50] (03CR) 10Clément Goubert: [C: 03+1] httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [16:26:52] (03PS1) 10Btullis: Revert "Increase the presto cluster size to 15 hosts again" [puppet] - 10https://gerrit.wikimedia.org/r/884928 [16:27:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:29:32] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Updated Java security policy in OpenJDK 11.0.18 - https://phabricator.wikimedia.org/T328331 (10MoritzMuehlenhoff) [16:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1630). [16:30:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1084.eqiad.wmnet [16:30:25] (03CR) 10Btullis: [C: 03+2] Revert "Increase the presto cluster size to 15 hosts again" [puppet] - 10https://gerrit.wikimedia.org/r/884928 (owner: 10Btullis) [16:30:38] (03CR) 10Ottomata: [C: 03+1] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [16:30:56] (03CR) 10Ottomata: [C: 03+1] "Either Brian or I will build these and deploy soon." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [16:31:08] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MPhamWMF) [16:31:21] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MPhamWMF) [16:35:22] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4043.ulsfo.wmnet with OS bullseye [16:35:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye executed with errors: - cp4043 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [16:35:41] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye [16:35:48] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye [16:37:04] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43511 and previous config saved to /var/cache/conftool/dbconfig/20230130-163842-ladsgroup.json [16:38:59] (03CR) 10Elukey: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [16:39:37] (03PS6) 10Jbond: redfish: Move delli specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [16:39:39] (03PS5) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [16:39:53] (03Abandoned) 10Jbond: redfish: fix generation test [software/spicerack] - 10https://gerrit.wikimedia.org/r/884921 (owner: 10Jbond) [16:40:42] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:06] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10bd808) #Toolhub does not have a working Kubernetes deployment outside of eqiad ({T288685}). Who should I work with to try and preve... [16:41:36] (03CR) 10Ottomata: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [16:42:34] (03CR) 10CI reject: [V: 04-1] redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [16:43:29] (03CR) 10CI reject: [V: 04-1] redfish: Move delli specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [16:44:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3050.esams.wmnet with reason: host reimage [16:44:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [16:44:48] (03CR) 10Ilias Sarantopoulos: feat: add json payload capability (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [16:46:03] (03CR) 10Klausman: [C: 03+1] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:46:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) [16:46:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [16:48:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3050.esams.wmnet with reason: host reimage [16:48:54] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski) [16:50:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [16:51:20] (03PS7) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [16:52:32] (03CR) 10Andrew Bogott: "I will fix the linter issue but here's pcc results:" [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [16:53:36] (03PS3) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) [16:53:42] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:53:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43512 and previous config saved to /var/cache/conftool/dbconfig/20230130-165348-ladsgroup.json [16:53:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:53:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:53:53] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:53:58] (03CR) 10CI reject: [V: 04-1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [16:53:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43513 and previous config saved to /var/cache/conftool/dbconfig/20230130-165359-ladsgroup.json [16:54:30] (03CR) 10Arturo Borrero Gonzalez: Rabbitmq: use OpenStack bpo packages for rabbit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [16:54:47] (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [16:54:51] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:56:09] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10MPhamWMF) [16:56:20] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10LSobanski) [16:56:32] (03PS8) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [16:56:36] (03PS6) 10Jbond: redfish: store all manager info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [16:56:37] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [16:56:40] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10Gehel) [16:59:23] (03PS7) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [16:59:26] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [16:59:40] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [16:59:44] (03CR) 10CI reject: [V: 04-1] redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [17:02:50] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Updated Java security policy in OpenJDK 11.0.18 - https://phabricator.wikimedia.org/T328331 (10MoritzMuehlenhoff) [17:03:33] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [17:04:09] (03PS8) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [17:04:25] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service Slyngshede In setup. Downtimed https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43514 and previous config saved to /var/cache/conftool/dbconfig/20230130-170437-ladsgroup.json [17:04:43] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [17:06:56] (03PS4) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) [17:07:48] (03CR) 10CI reject: [V: 04-1] redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [17:08:18] (03CR) 10Bking: [C: 03+1] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [17:09:38] (03PS2) 10Bking: flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [17:10:04] (03CR) 10Bking: [V: 03+2] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [17:10:07] (03CR) 10Bking: [V: 03+2 C: 03+2] flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [17:10:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:54] (03CR) 10Ebernhardson: Create scap deployment source for search airflow v2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [17:11:51] (03PS5) 10Andrew Bogott: Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) [17:11:53] (03CR) 10Jbond: redfish: Move dell specific functionality to dell class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [17:12:13] (03PS1) 10Hokwelum: The rsync module have been changed from download.kiwix.org to wmf.download.kiwix.org, See phab ticket for more information [puppet] - 10https://gerrit.wikimedia.org/r/884965 [17:12:25] 10SRE, 10Traffic-Icebox: varnish warnings: Invalid conf pair: lg_dirty_mult/lg_chunk - https://phabricator.wikimedia.org/T253379 (10BCornwall) 05Open→03Resolved a:03BCornwall This has already been removed on 2022-11-11 via: Commit: 9943816a2ee487128f77c18cd2b104ebe1c0cd50 Change-Id: Ib55afb0acc28eab197c... [17:12:34] (03CR) 10CI reject: [V: 04-1] The rsync module have been changed from download.kiwix.org to wmf.download.kiwix.org, See phab ticket for more information [puppet] - 10https://gerrit.wikimedia.org/r/884965 (owner: 10Hokwelum) [17:12:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3050.esams.wmnet with OS bullseye [17:12:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3050.esams.wmnet with OS bullseye completed: - cp3050 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:14:54] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/884922/39324/" [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [17:15:01] (03PS2) 10Hokwelum: Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 [17:15:21] (03CR) 10CI reject: [V: 04-1] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (owner: 10Hokwelum) [17:15:26] 10SRE, 10PyBal, 10Traffic-Icebox: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10BCornwall) Given the intention of moving away from LVS, is this still a feature we want implemented? i.e. is it worth pursuing this when LVS may be replaced in a few years? [17:19:08] (03PS3) 10Hokwelum: Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223) [17:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43515 and previous config saved to /var/cache/conftool/dbconfig/20230130-171944-ladsgroup.json [17:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:21:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS bullseye [17:21:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS bullseye completed: - cp5026 (**PASS**) - Removed from Puppet and PuppetDB if present -... [17:22:04] !log bking@build2001 rebuilding docker images for 884351 [17:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:33] (03PS9) 10Jbond: redfish: Move dell specific functionality to dell class [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 [17:22:35] (03PS9) 10Jbond: redfish: store all OOB info for later use [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 [17:24:02] !log bking@build2001 rebuilding docker images for 884351 complete [17:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:49] (03PS1) 10Ottomata: [WIP] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) [17:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:26:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:27:04] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS bullseye [17:27:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4043.ulsfo.wmnet with OS bullseye completed: - cp4043 (**WARN**) - Removed from Puppet and PuppetDB if present -... [17:28:29] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:31:05] (03CR) 10Ottomata: Configure search platform airflow 2 instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [17:31:09] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet [17:31:44] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:31:49] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS bullseye [17:32:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye [17:33:01] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:12] (03PS1) 10Legoktm: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) [17:33:45] PROBLEM - IPMI Sensor Status on mw2330 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:34:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=cdn [17:34:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=ats-be [17:34:42] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:34:44] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:34:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43516 and previous config saved to /var/cache/conftool/dbconfig/20230130-173450-ladsgroup.json [17:35:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:35:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:35] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=cdn [17:36:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-be [17:40:47] PROBLEM - IPMI Sensor Status on mw2332 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:41:32] (03PS1) 10Jbond: redfish: add system_manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 [17:43:13] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836749 (owner: 10Jbond) [17:43:25] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836757 (owner: 10Jbond) [17:43:26] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4051.ulsfo.wmnet with OS bullseye [17:43:31] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye executed with errors: - cp4051 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [17:43:53] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS bullseye [17:43:58] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884978 (owner: 10Jbond) [17:44:00] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye [17:45:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [17:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:46:58] (03PS1) 10Jdlrobson: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423) [17:49:21] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3052.esams.wmnet [17:49:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43517 and previous config saved to /var/cache/conftool/dbconfig/20230130-174957-ladsgroup.json [17:50:02] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [17:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:11] PROBLEM - IPMI Sensor Status on maps2009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:52:14] ^ seems to be codfw rack B6 [17:52:18] (03PS3) 10Urbanecm: [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 [17:52:24] maps2009, mw2330, etc. [17:52:27] (03CR) 10Urbanecm: [C: 03+2] [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm) [17:52:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm) [17:53:13] (03Merged) 10jenkins-bot: [Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884427 (owner: 10Urbanecm) [17:53:30] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884427|[Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled]] [17:53:38] sukhe: probably worth mentioning in -dcops so everything else doesn’t drown out [17:53:55] RhinosF1: yeah, going to file a task, sometimes there are recoveries so was waiting a bit [17:54:49] Cool :) [17:56:44] (03CR) 10Andrew Bogott: [C: 03+1] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223) (owner: 10Hokwelum) [17:57:03] 10SRE, 10ops-codfw, 10DC-Ops: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10ssingh) [17:57:13] (03CR) 10Andrew Bogott: [C: 03+2] Change kiwix rsync module [puppet] - 10https://gerrit.wikimedia.org/r/884965 (https://phabricator.wikimedia.org/T260223) (owner: 10Hokwelum) [17:57:20] 10SRE, 10ops-codfw, 10DC-Ops: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10ssingh) p:05Triage→03Medium [17:57:21] (03CR) 10Hashar: [C: 03+1] jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [17:57:52] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10akosiaris) It is intentional indeed. `-devel` because obsolete. More information in T306996#7912881 and overall that task. [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1800) [18:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T1800) [18:01:29] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884427|[Growth] Remove wgGERecentChangesUnstarredMenteesFilterEnabled]] (duration: 07m 59s) [18:04:19] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [18:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:06:16] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [18:06:54] (03PS1) 10Bking: flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576) [18:07:26] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [18:07:29] PROBLEM - IPMI Sensor Status on mw2326 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:08:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp3052.esams.wmnet [18:10:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye [18:10:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye [18:10:47] RECOVERY - IPMI Sensor Status on mw2332 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:13:08] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) Postponing the PDU maintenance for 2023-02-02 for possible bad weather in Dallas tomorrow. [18:13:27] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-02-02 - https://phabricator.wikimedia.org/T327404 (10Papaul) [18:19:03] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3052.esams.wmnet with OS bullseye [18:19:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:09] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [18:19:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye [18:19:35] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye [18:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:37] RECOVERY - IPMI Sensor Status on maps2009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:22:59] (03CR) 10Hashar: [C: 04-1] "Configuration bits for the release Jenkins should be moved up to profile::releases::mediawiki . And later on the CI Jenkins will have its" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [18:23:58] (03CR) 10Hashar: "From the parent change it should be done using hiera configuration by setting the `jenkins::use_scap3_deployment` flag in the `hiera/hosts" [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [18:26:01] PROBLEM - IPMI Sensor Status on kubernetes2009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:27:21] PROBLEM - IPMI Sensor Status on mw2334 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:29:13] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [18:29:15] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [18:31:08] RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2023-01-30 17:17:18 (836 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [18:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:33:48] RECOVERY - IPMI Sensor Status on mw2330 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:34:20] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4051.ulsfo.wmnet with OS bullseye [18:34:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4051.ulsfo.wmnet with OS bullseye completed: - cp4051 (**WARN**) - Removed from Puppet and PuppetDB if present -... [18:37:24] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:37:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:37:52] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3052.esams.wmnet with OS bullseye [18:38:00] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**) - Removed from Puppet and PuppetDB if p... [18:38:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye [18:38:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye [18:41:31] (03CR) 10Kosta Harlan: GrowthExperiments: Update campaign configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza) [18:43:28] (03PS3) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) [18:43:30] (03CR) 10Ebernhardson: Configure search platform airflow 2 instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [18:44:31] (03CR) 10Kosta Harlan: GrowthExperiments: Update campaign configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza) [18:44:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39326/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [18:45:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3052.esams.wmnet with OS bullseye [18:45:31] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye executed with errors: - cp3052 (**FAIL**) - Removed from Puppet and PuppetDB if p... [18:45:49] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:46:04] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:46:08] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:46:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp3052.esams.wmnet'] [18:50:46] (03CR) 10RLazarus: [C: 04-1] "Thanks for the patch! We ought to support this in httpbb -- the only reason it's not there already is that we haven't needed it yet." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [18:52:33] (03PS1) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [18:53:07] (03PS10) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [18:55:51] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [18:56:40] RECOVERY - IPMI Sensor Status on kubernetes2009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:56:56] (03CR) 10Ottomata: [C: 03+1] flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:58:02] RECOVERY - IPMI Sensor Status on mw2334 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:00:26] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 (owner: 10PipelineBot) [19:00:33] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260 (owner: 10PipelineBot) [19:00:52] (03Abandoned) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/826957 (owner: 10PipelineBot) [19:01:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS bullseye [19:01:03] (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot) [19:01:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye [19:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:10:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:15:37] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4051.ulsfo.wmnet [19:15:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:36] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye [19:16:43] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:16:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye [19:18:10] 10SRE, 10PyBal, 10Traffic-Icebox: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10ayounsi) It's fine to close this task as long as BFD and graceful-shutdown are on the roadmap for the new L4LB. Not directly related to LVS but the task description on {T328338} explains... [19:19:07] (03CR) 10Dzahn: [C: 03+2] "per https://debmonitor.wikimedia.org/packages/atftpd it's only installed on install* machines and per sudo cumin 'C:role::installserver' '" [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [19:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3052.esams.wmnet with reason: host reimage [19:25:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3052.esams.wmnet with reason: host reimage [19:26:44] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4044.ulsfo.wmnet with OS bullseye [19:26:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye executed with errors: - cp4044 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [19:26:59] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye [19:27:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye [19:31:42] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) [19:32:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:32] (03PS1) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 [19:33:27] (03CR) 10Bking: [C: 03+2] flink-k8s-operator: bump internal version [deployment-charts] - 10https://gerrit.wikimedia.org/r/884983 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [19:34:22] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) I suppose we could also have aliases for the bastion hosts so instead of connecting to `bast3006` users can specify `bast-esams` (which would actually be a huge improvement) but geolocating w... [19:34:31] (03CR) 10CI reject: [V: 04-1] rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [19:35:47] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:36:12] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:37:23] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [19:43:11] (03PS1) 10Jbond: reposync: switch from copy_tree to copytree [software/spicerack] - 10https://gerrit.wikimedia.org/r/884998 [19:43:55] 10SRE, 10Prod-Kubernetes, 10PyBal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10ayounsi) [19:44:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) [19:44:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2033.codfw.wmnet with OS bullseye [19:44:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2033.codfw.wmnet with OS bullseye [19:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:47:14] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [19:48:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3052.esams.wmnet with OS bullseye [19:48:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp3052.esams.wmnet with OS bullseye completed: - cp3052 (**PASS**) - Removed from Puppet and PuppetDB if present -... [19:50:18] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [19:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:51:34] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) [19:52:44] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) [19:53:03] (03PS9) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [19:53:51] (03CR) 10Samtar: "(reset my CR, T310974#8368960 is stalling afaict)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [19:56:02] (03PS1) 10Zabe: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 [19:57:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3052.esams.wmnet,service=cdn [19:57:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3052.esams.wmnet,service=ats-be [19:58:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:58:50] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10RhinosF1) >>! In T328361#8571672, @mpopov wrote: > I suppose we could also have aliases for the bastion hosts so instead of connecting to `bast3006` users can specify `bast-esams` (which would actual... [20:00:29] (03PS2) 10Zabe: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 [20:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:00:49] (03PS2) 10Ottomata: [WIP] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) [20:02:11] (03PS3) 10Ottomata: Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) [20:02:13] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) [20:03:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage [20:03:30] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) @Vgutierrez I've confirmed the remaining services use TLSv1.2+ except for ldap-codfw1dev and ldap-labtest. I'm having a little trouble accessing those servers - are they still... [20:05:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:06:21] (03CR) 10CI reject: [V: 04-1] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:06:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage [20:11:45] (03PS3) 10Urbanecm: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [20:11:50] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [20:12:31] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS bullseye [20:12:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4044.ulsfo.wmnet with OS bullseye completed: - cp4044 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:12:45] 10SRE, 10Traffic-Icebox: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10BCornwall) 05Open→03Declined As there's already a link to the browser recommendation wikitech page, there's no need to duplicate efforts. [20:12:49] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) 05Open→03In progress [20:12:55] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10BCornwall) a:03BCornwall [20:13:55] (03CR) 10Zabe: [C: 03+2] slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [20:14:38] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [20:14:48] (03Merged) 10jenkins-bot: slwiki: Raise AF emergency disable treshold+count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885002 (https://phabricator.wikimedia.org/T328366) (owner: 10Zabe) [20:15:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:15:40] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [20:15:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye [20:16:01] !log zabe@deploy1002 Started scap: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]] [20:17:39] !log zabe@deploy1002 zabe: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:23:34] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:885002|slwiki: Raise AF emergency disable treshold+count (T328366)]] (duration: 07m 32s) [20:25:57] (03PS1) 10Majavah: hieradata: drop ldap-labtest acme-chier cert [puppet] - 10https://gerrit.wikimedia.org/r/885026 [20:26:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2033.codfw.wmnet with OS bullseye [20:26:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2033.codfw.wmnet with OS bullseye completed: - cp2033 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:26:42] (03PS3) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) [20:26:45] (03CR) 10Ottomata: [C: 03+2] "Merging to test deployment, skipping the helmfile lint error. Something must be wrong with a .Values.kafka_brokers fixture for this helmf" [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:27:07] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) (owner: 10Gergő Tisza) [20:29:11] 10SRE, 10Traffic-Icebox: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10taavi) >>! In T238518#8571817, @BCornwall wrote: > @Vgutierrez I've confirmed the remaining services use TLSv1.2+ except for ldap-codfw1dev and ldap-labtest. I'm having a little trouble... [20:29:27] (03CR) 10CI reject: [V: 04-1] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:30:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dse-k8s-services/mediawiki-page-content-change-enrichment helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/884972 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:35:32] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [20:35:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye executed with errors: - cp4052 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [20:35:55] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [20:36:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye [20:36:13] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10mpopov) > You’d then get a scary warning about a key mismatch when the server was changed. > > Surely, this host doesn’t exist anymore is a clearer error. Oh you're right! That's a great point, tha... [20:45:35] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305) [20:46:53] (03PS2) 10Ottomata: mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305) [20:49:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki-page-content-change-enrichment - bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/885032 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:50:42] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:51:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:56:36] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [20:59:41] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2100). [21:00:05] tgr, musikanimal, legoktm, jdlrobson, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:20] o/ [21:00:23] i can deploy today [21:00:40] (03PS3) 10Urbanecm: InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal) [21:00:44] (03CR) 10Urbanecm: [C: 03+2] InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal) [21:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:01:02] o/ [21:01:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal) [21:01:24] (03Merged) 10jenkins-bot: InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884474 (https://phabricator.wikimedia.org/T326387) (owner: 10MusikAnimal) [21:01:34] hi tgr_, CI seems to dislike the campaigns patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884153/). can you check please? [21:01:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]] [21:01:39] hi I'm here [21:01:46] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [21:01:56] hi legoktm [21:02:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=cdn [21:02:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be [21:02:17] present [21:02:21] (03PS2) 10Urbanecm: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm) [21:02:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [21:02:37] (03CR) 10Urbanecm: [C: 03+2] Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm) [21:02:49] (03CR) 10Urbanecm: [C: 03+2] Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson) [21:03:21] !log urbanecm@deploy1002 urbanecm and musikanimal: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:03:33] musikanimal: pulled onto mwdebug1001, let me know how it works :) [21:03:45] will do! might take me a few mins, sorry I wasn't prepared [21:03:55] sure [21:04:14] (03PS4) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) [21:04:29] (03Merged) 10jenkins-bot: Support new style of table of contents [extensions/GlobalUserPage] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884929 (https://phabricator.wikimedia.org/T327942) (owner: 10Legoktm) [21:04:38] urbanecm: oops, sorry. last minute changes. Should be fixed now. [21:04:47] np, it happens. [21:08:35] so this has to only do with data storage. That must persist across prod and the debug servers, right? We don't have a separate db for mwdebug* ? [21:08:43] indeed [21:09:57] okay. Well my issue is I can't find an example... I need an article that uses a "task force" in addition to a WikiProject. Might take me another 5-10 minutes... unfortunately Whatlinkshere isn't giving good results because the task force template is also used by normal WikiProjects [21:10:19] what's a task force? maybe i can help? [21:10:30] (03CR) 10Gergő Tisza: GrowthExperiments: Update campaign configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [21:10:44] (03PS1) 10Ottomata: mw-page-content-change-enrichment - Disable kafka egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/885035 (https://phabricator.wikimedia.org/T325305) [21:11:05] a task force is a subset of a WikiProject. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Guide/Task_forces is the enwiki documentation, zhwiki does the same thing [21:11:36] I'm running some queries on prod to try to find an example. The WikiProject name would have a slash in it (as it is a "subproject" of the WikiProject, so to speak) [21:11:51] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2019.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001 [21:12:16] PROBLEM - IPMI Sensor Status on mw2333 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:12:23] tz [21:13:40] musikanimal: does https://zh.wikipedia.org/wiki/WikiProject:%E7%94%B5%E5%AD%90%E6%B8%B8%E6%88%8F/%E5%8F%B2%E5%85%8B%E5%A8%81%E5%B0%94%E8%89%BE%E5%B0%BC%E5%85%8B%E6%96%AF work< [21:14:00] possibly [21:14:15] https://zh.wikipedia.org/wiki/Talk:%E5%90%89%E6%99%AE%E6%81%B0%E5%85%8B%E6%B8%85%E7%9C%9F%E5%AF%BA for sure uses a task force, but I'm not seeing the flag being set in the db after I do a null edit :( [21:14:41] might be because it uses a job? [21:14:54] it usually populates immediately if I do a null edit [21:14:58] ah [21:14:58] but I could be testing this wrong [21:15:19] since the site doesn't break, i can sync and let you and your team figure out what's happening later? [21:15:21] so I'm like 99% sure the patch is harmless. Page assessments aren't even being used right now by anything [21:15:23] yeah [21:15:27] okay, syncing [21:15:29] let's just move forward :) [21:15:31] thanks! [21:16:18] (03PS2) 10Urbanecm: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza) [21:16:24] (03CR) 10Urbanecm: [C: 03+2] Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza) [21:17:11] (03Merged) 10jenkins-bot: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) (owner: 10Gergő Tisza) [21:17:13] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrichment - Disable kafka egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/885035 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [21:17:40] tgr_: should we copy the messages from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/884152 on wiki? or is it not important enough for the initial rollout? [21:17:51] it's marked as soft depend-on, so that's why i'm asking [21:17:51] (03Merged) 10jenkins-bot: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884930 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson) [21:18:23] Not important, the real rollout is when something starts to reference this in landing page URLs. [21:18:40] ah, makes sense. i'll go ahead then. [21:20:40] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:21:29] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884474|InitialiseSettings: add zhwiki to wgPageAssessmentsSubprojects (T326387)]] (duration: 19m 51s) [21:21:34] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [21:21:41] musikanimal: all live now. [21:21:48] thank you! [21:21:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]] [21:21:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2019.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001 [21:21:56] np [21:21:57] T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423 [21:21:57] T325376: Welcome survey: communication and deployment to Vietnamese Wikipedia - https://phabricator.wikimedia.org/T325376 [21:21:58] T327942: Table of contents displays wrong on global user pages on Vector 2022 - https://phabricator.wikimedia.org/T327942 [21:23:27] !log urbanecm@deploy1002 tgr and urbanecm and jdlrobson and legoktm: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:23:44] testing [21:23:48] thanks [21:23:53] tgr_: Jdlrobson: please test too ^^ [21:24:09] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2020.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001 [21:24:10] works [21:24:13] ty [21:24:23] urbanecm: looking! [21:24:24] urbanecm: lgtm! thanks [21:24:28] thanks! [21:24:38] (verified with https://test.wikipedia.org/wiki/User:Legoktm?useskin=vector-2022) [21:25:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2030.codfw.wmnet with OS bullseye [21:25:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2030.codfw.wmnet with OS bullseye [21:25:31] urbanecm: LGTM [21:25:37] arlolra: hi, are you around for your MW core / https://gerrit.wikimedia.org/r/c/884138/ backport? looks like a no-op just adding some profiling, but still wouldn't like to deploy it alone :)) [21:25:43] yup [21:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:25:50] thanks Jdlrobson, deploying [21:25:59] (03CR) 10Urbanecm: [C: 03+2] Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (https://phabricator.wikimedia.org/T328201) (owner: 10Arlolra) [21:26:22] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.24 ms [21:26:38] arlolra: will you want to test it at a debug server? or should i just sync? [21:26:39] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [21:26:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4052.ulsfo.wmnet with OS bullseye completed: - cp4052 (**WARN**) - Removed from Puppet and PuppetDB if present -... [21:26:54] urbanecm: I can try a quick test [21:27:03] okay, i'll ping you when ready [21:27:15] 10SRE-OnFire, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10andrea.denisse) [21:27:42] (03PS5) 10Urbanecm: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [21:27:48] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [21:29:01] (03Merged) 10jenkins-bot: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [21:30:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:31:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883301|Enable WelcomeSurvey at viwiki (T325376)]], [[gerrit:884930|Fix grid blowout with limited width turned off (T327423)]], [[gerrit:884929|Support new style of table of contents (T327942)]] (duration: 09m 52s) [21:31:48] T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423 [21:31:49] T325376: Welcome survey: communication and deployment to Vietnamese Wikipedia - https://phabricator.wikimedia.org/T325376 [21:31:49] T327942: Table of contents displays wrong on global user pages on Vector 2022 - https://phabricator.wikimedia.org/T327942 [21:31:51] legoktm: tgr_: Jdlrobson: all live :) [21:32:09] perfect :D [21:32:25] thx! [21:32:32] (03CR) 10Cwhite: [C: 03+1] Send rsyslog output for vrts apache logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/884909 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [21:33:34] (03CR) 10Cwhite: [C: 03+1] "Sounds good to me!" [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron) [21:33:59] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [21:33:59] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]] [21:34:05] T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370 [21:34:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2020.codfw.wmnet: Replace Cassandra keys & certs - eevans@cumin1001 [21:34:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [21:34:47] (03CR) 10Andrew Bogott: [C: 03+2] Rabbitmq: use OpenStack bpo packages for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/884922 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [21:34:53] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:35:41] !log urbanecm@deploy1002 tgr and urbanecm: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:35:56] tgr_: second patch's available for testing, can you check? [21:36:41] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008 [21:36:57] urbanecm: it works [21:37:04] great, syncing [21:40:27] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008 (owner: 10Urbanecm) [21:41:16] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885008 (owner: 10Urbanecm) [21:41:40] Ack! thanks urbanecm ! [21:41:44] no problem [21:42:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884153|GrowthExperiments: Update campaign configuration (T321370)]] (duration: 08m 47s) [21:42:51] T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370 [21:42:54] tgr_: and live [21:43:12] thanks! [21:43:43] (03Merged) 10jenkins-bot: Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (https://phabricator.wikimedia.org/T328201) (owner: 10Arlolra) [21:43:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage [21:44:12] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 72 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:44:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]] [21:44:26] T328201: Investigate increase in slow parses - https://phabricator.wikimedia.org/T328201 [21:46:03] !log urbanecm@deploy1002 arlolra and urbanecm: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:46:13] arlolra: your patch's at mwdebug1001, as promised :) [21:46:24] alrighty [21:47:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage [21:48:13] (03CR) 10Cwhite: [V: 04-1 C: 04-1] rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [21:50:17] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Jcross) Hi @BBlack and @Vgutierrez - could you please provide an update or some guidance around your expected timeline for this? Please let us know if anything else is required on our end... [21:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:50:55] urbanecm: ok, let's proceed [21:50:59] okay, doing [21:51:51] (03PS2) 10Cwhite: role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 [21:54:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:55:14] (03PS1) 10Dreamy Jazz: Disable write old for CheckUserLog reason field for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885041 (https://phabricator.wikimedia.org/T233004) [21:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:56:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884138|Try to determine what's adding to Parsoid init times (T328201)]], [[gerrit:885008|Update interwiki cache]] (duration: 12m 24s) [21:56:49] T328201: Investigate increase in slow parses - https://phabricator.wikimedia.org/T328201 [21:56:52] arlolra: and, live [21:56:58] thank you [21:57:15] no problem [21:57:21] i think we're done with the window [22:00:05] Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2200). [22:08:26] (03CR) 10RLazarus: [C: 03+2] httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [22:11:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2030.codfw.wmnet with OS bullseye [22:11:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2030.codfw.wmnet with OS bullseye completed: - cp2030 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [22:13:12] RECOVERY - IPMI Sensor Status on mw2333 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:16:44] (03CR) 10Cwhite: [C: 03+2] role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 (owner: 10Cwhite) [22:19:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:47] 10SRE, 10Traffic, 10Data Pipelines (Sprint 07): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10odimitrijevic) Pinging @KOfori @BBlack. Please see question above. [22:20:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:22:06] PROBLEM - IPMI Sensor Status on mw2329 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:22:42] 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10BCornwall) 05Open→03Stalled @ayounsi which commit implemented this? I'm not seeing any reference anywhere [22:25:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:32:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:35:30] PROBLEM - IPMI Sensor Status on mw2332 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:36:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=cdn [22:36:12] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be [22:36:44] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [22:37:37] volans, marostegui: do you know why rpl_semi_sync_master_wait_no_slave is 0 ? [22:38:42] 10SRE, 10Traffic-Icebox: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BCornwall) 05Open→03Stalled @BBlack This ticket is quite broad: Can we split any remaining actionable into sub-tickets? From what I'm understanding, new tickets could be: * Re... [22:38:57] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS bullseye [22:39:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye [22:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:50:00] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3053.esams.wmnet with OS bullseye [22:50:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye executed with errors: - cp3053 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [22:55:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:58] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [23:00:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:04:43] 10SRE: Route users to closest bastion host based on IP geolocation - https://phabricator.wikimedia.org/T328361 (10Dzahn) Here is a crude shell script from the past trying to solve this problem. No warranty :) https://people.wikimedia.org/~dzahn/bastion.sh.txt [23:06:14] (03PS1) 10Sbailey: Enable Linter write namespace, tag and template for group0 and group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) [23:07:05] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS bullseye [23:07:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye [23:09:04] (03CR) 10Sbailey: "Group 0 went smoothly, onto group 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885046 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [23:10:54] (03PS1) 10Gergő Tisza: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 [23:16:31] jouncebot: nowandnext [23:16:31] For the next 0 hour(s) and 43 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230130T2200) [23:16:32] In 3 hour(s) and 43 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230131T0300) [23:16:53] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza) [23:21:44] (03CR) 10Dzahn: [C: 03+2] etherpad: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884396 (https://phabricator.wikimedia.org/T327974) (owner: 10Dzahn) [23:23:31] (03PS1) 10Dzahn: etherpad: fix typo in blackbox::check class parameter name [puppet] - 10https://gerrit.wikimedia.org/r/885050 [23:26:07] (03CR) 10Dzahn: [C: 03+2] etherpad: fix typo in blackbox::check class parameter name [puppet] - 10https://gerrit.wikimedia.org/r/885050 (owner: 10Dzahn) [23:26:49] (03PS1) 10Dreamy Jazz: Remove redundant definition of wgCheckUserEnableSpecialInvestigate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885051 [23:29:49] !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3053.esams.wmnet with OS bullseye [23:29:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp3053.esams.wmnet with OS bullseye executed with errors: - cp3053 (**FAIL**) - Removed from Puppet and PuppetDB if p... [23:30:13] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS bullseye [23:30:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5027.eqsin.wmnet with OS bullseye [23:45:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable