[00:00:04] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:09:35] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T317804 (10wiki_willy) a:03Cmjohnson [01:10:16] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10wiki_willy) a:03Cmjohnson [01:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T314041)', diff saved to https://phabricator.wikimedia.org/P34972 and previous config saved to /var/cache/conftool/dbconfig/20220928-012205-ladsgroup.json [01:24:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P34973 and previous config saved to /var/cache/conftool/dbconfig/20220928-013711-ladsgroup.json [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P34974 and previous config saved to /var/cache/conftool/dbconfig/20220928-015218-ladsgroup.json [02:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T314041)', diff saved to https://phabricator.wikimedia.org/P34975 and previous config saved to /var/cache/conftool/dbconfig/20220928-020724-ladsgroup.json [02:07:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [02:07:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T314041)', diff saved to https://phabricator.wikimedia.org/P34976 and previous config saved to /var/cache/conftool/dbconfig/20220928-020746-ladsgroup.json [02:09:37] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:16:17] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:18:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:26:51] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:26:51] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:26:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:26:55] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:28:43] here, looking [02:29:24] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:31:51] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:31:51] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:31:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:31:55] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:10] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:58:58] I'm unable to find out the cause in turnilo, calling it a night since it has recovered on its own. [03:22:42] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:50] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:04] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:31:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:33:38] (03CR) 10KartikMistry: [C: 03+1] Update Translate job names [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [03:34:26] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:36:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:37:04] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T314041)', diff saved to https://phabricator.wikimedia.org/P34977 and previous config saved to /var/cache/conftool/dbconfig/20220928-034511-ladsgroup.json [03:45:16] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:46:20] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:08] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557) [03:55:42] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P34978 and previous config saved to /var/cache/conftool/dbconfig/20220928-040017-ladsgroup.json [04:02:44] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-echetty-singleuser.service,jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P34979 and previous config saved to /var/cache/conftool/dbconfig/20220928-041524-ladsgroup.json [04:20:32] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:26:08] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-echetty-singleuser.service,jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,session-c4449.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T314041)', diff saved to https://phabricator.wikimedia.org/P34980 and previous config saved to /var/cache/conftool/dbconfig/20220928-043030-ladsgroup.json [04:30:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [04:30:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [04:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T314041)', diff saved to https://phabricator.wikimedia.org/P34981 and previous config saved to /var/cache/conftool/dbconfig/20220928-043052-ladsgroup.json [04:32:45] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:38:17] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:40:08] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:54:12] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-echetty-singleuser.service,jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,session-c4449.scope,session-c4450.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829552 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [05:10:37] <_joe_> jouncebot: now [05:10:37] No deployments scheduled for the next 1 hour(s) and 49 minute(s) [05:10:43] <_joe_> jouncebot: next [05:10:43] In 1 hour(s) and 49 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T0700) [05:10:56] <_joe_> ok let's do this well before BACON [05:15:16] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-echetty-singleuser.service,jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,session-c4449.scope,session-c4450.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:54] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-echetty-singleuser.service,jupyter-urbanecm-singleuser.service,session-c4122.scope,session-c4123.scope,session-c4124.scope,session-c4447.scope,session-c4449.scope,session-c4450.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:44] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:33:26] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:41:52] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:42:10] looking [05:46:51] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:46:52] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:51:51] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:51:52] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:51:52] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:51:56] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:52:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:57:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:06:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:06:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:06:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [06:06:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [06:07:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829553 (owner: 10Giuseppe Lavagetto) [06:07:30] <_joe_> hey ho, let's go [06:08:11] o/ [06:08:24] nice way to start the day... 8m in my shift, page [06:09:14] <_joe_> the port utilization? [06:09:19] yup [06:09:24] <_joe_> arzhel and I were taking a look already [06:11:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:11:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:11:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [06:11:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [06:19:10] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Upgrade 10.6.10 [software] - 10https://gerrit.wikimedia.org/r/835508 (https://phabricator.wikimedia.org/T318128) (owner: 10Marostegui) [06:19:45] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Upgrade 10.6.10 [software] - 10https://gerrit.wikimedia.org/r/835508 (https://phabricator.wikimedia.org/T318128) (owner: 10Marostegui) [06:34:32] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:10] (03PS5) 10Giuseppe Lavagetto: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [06:35:12] (03PS1) 10Giuseppe Lavagetto: cache::upload: rate-limit requests from aws for datasets [puppet] - 10https://gerrit.wikimedia.org/r/836093 [06:35:38] (03CR) 10Elukey: [C: 03+1] Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [06:37:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:38:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:40:06] (03CR) 10Hashar: [C: 03+1] "I do not know anything about reprepro distributions and updates configuration. If it is good for Moritz it is good to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [06:40:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:45:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:51:09] (03PS1) 10Giuseppe Lavagetto: deployment_server: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836094 (https://phabricator.wikimedia.org/T271736) [06:54:59] !incidents [06:54:59] You're not allowed to perform this action. [06:55:16] _joe_: ^ [06:55:34] <_joe_> XioNoX: uhm [07:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:36] * kart_ is here [07:01:12] Going ahead with deployment.. [07:01:25] (03CR) 10KartikMistry: [C: 03+2] testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557) (owner: 10KartikMistry) [07:02:14] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557) (owner: 10KartikMistry) [07:02:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557) (owner: 10KartikMistry) [07:03:24] !log kartik@deploy1002 Started scap: Backport for [[gerrit:835606|testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias (T314557)]] [07:03:28] T314557: Enable Content and Section translation on wikipedias with new MT support from Google for languages once it is working - https://phabricator.wikimedia.org/T314557 [07:03:49] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:835606|testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias (T314557)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:05:15] Used `scap backport` \0/ [07:07:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [07:07:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:08:42] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:835606|testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias (T314557)]] (duration: 05m 17s) [07:08:45] T314557: Enable Content and Section translation on wikipedias with new MT support from Google for languages once it is working - https://phabricator.wikimedia.org/T314557 [07:09:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:30:24] !log disable BGP to init7 in knams [07:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:56] (03PS1) 10Elukey: admin_ng: use fqdn for ml-serve's knative serving settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/836099 (https://phabricator.wikimedia.org/T313915) [07:36:30] (03CR) 10Klausman: [C: 03+1] admin_ng: use fqdn for ml-serve's knative serving settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/836099 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:37:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [07:37:25] XioNoX: worked! [07:37:29] alright [07:37:36] I'll shoot them an email [07:39:06] nice :) [07:39:46] (03CR) 10Elukey: [C: 03+2] admin_ng: use fqdn for ml-serve's knative serving settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/836099 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:40:49] (03CR) 10Hashar: [C: 03+1] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835667 (owner: 10PipelineBot) [07:44:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:45:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:48:16] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:50:36] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:30] (03CR) 10JMeybohm: [C: 04-1] "Envoy metrics are not scraped by the job k8s-pods but k8s-pods-tls (which does not have ideal naming, I do see that 😊 - feel free to add a" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [07:51:47] (03PS1) 10Elukey: Revert "admin_ng: use fqdn for ml-serve's knative serving settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/835604 [07:52:54] (03CR) 10Klausman: [C: 03+1] Revert "admin_ng: use fqdn for ml-serve's knative serving settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/835604 (owner: 10Elukey) [07:56:16] (03CR) 10Elukey: [C: 03+2] Revert "admin_ng: use fqdn for ml-serve's knative serving settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/835604 (owner: 10Elukey) [07:58:33] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:58:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:00:04] brennen and jnuche: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T0800). [08:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T314041)', diff saved to https://phabricator.wikimedia.org/P34984 and previous config saved to /var/cache/conftool/dbconfig/20220928-080015-ladsgroup.json [08:00:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:02:02] (03PS1) 10Filippo Giunchedi: Anchor /metrics to URL start [puppet] - 10https://gerrit.wikimedia.org/r/836100 (https://phabricator.wikimedia.org/T309703) [08:02:42] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:44] seeking a kind soul for a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/836100 [08:05:09] godog: no apache expert, but LocationMatch probably would need a matching closing locationmatch ? [08:05:28] (line 12) [08:05:47] doh! of course, thank you jynus [08:06:11] "and this is why we review" not enough caffeine [08:06:27] (03PS2) 10Filippo Giunchedi: Anchor /metrics to URL start [puppet] - 10https://gerrit.wikimedia.org/r/836100 (https://phabricator.wikimedia.org/T309703) [08:10:04] godog: if that is metrics-only I think it should be safe to merge, if a common template to other apaches I would wait for service ops review [08:10:16] I can +1 if the first [08:10:57] yeah it is used in grafana only atm [08:11:06] (03CR) 10Jcrespo: [C: 03+1] Anchor /metrics to URL start [puppet] - 10https://gerrit.wikimedia.org/r/836100 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [08:11:07] i.e. I introduced yesterday [08:11:13] thanks! appreciate it [08:11:18] sorry, I didn't have much context [08:11:18] (03CR) 10Filippo Giunchedi: [C: 03+2] Anchor /metrics to URL start [puppet] - 10https://gerrit.wikimedia.org/r/836100 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [08:11:37] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: disable email notifications on replicas [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) (owner: 10Jelto) [08:11:51] that's 100% okay, thanks for reviewing/helping [08:12:38] (03PS1) 10Elukey: kserve-inference: set ndots 2 in isvc configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/836102 (https://phabricator.wikimedia.org/T313915) [08:13:33] (03CR) 10CI reject: [V: 04-1] kserve-inference: set ndots 2 in isvc configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/836102 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:15:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P34985 and previous config saved to /var/cache/conftool/dbconfig/20220928-081522-ladsgroup.json [08:19:05] (03PS2) 10Elukey: kserve-inference: set ndots 2 in isvc configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/836102 (https://phabricator.wikimedia.org/T313915) [08:26:58] (03CR) 10Klausman: [C: 03+1] kserve-inference: set ndots 2 in isvc configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/836102 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:27:15] (03CR) 10Elukey: [C: 03+2] kserve-inference: set ndots 2 in isvc configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/836102 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:29:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:30:27] (03PS1) 10Filippo Giunchedi: grafana: limit /metrics ACL to grafana vhost only [puppet] - 10https://gerrit.wikimedia.org/r/836104 (https://phabricator.wikimedia.org/T309703) [08:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P34987 and previous config saved to /var/cache/conftool/dbconfig/20220928-083029-ladsgroup.json [08:34:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:35:23] (03PS2) 10Filippo Giunchedi: grafana: limit /metrics ACL to grafana vhost only [puppet] - 10https://gerrit.wikimedia.org/r/836104 (https://phabricator.wikimedia.org/T309703) [08:35:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:36:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:36:38] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37373/console" [puppet] - 10https://gerrit.wikimedia.org/r/836104 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [08:37:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:38:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:39:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:40:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:40:29] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] grafana: limit /metrics ACL to grafana vhost only [puppet] - 10https://gerrit.wikimedia.org/r/836104 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [08:40:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:40:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:08] (03CR) 10Arturo Borrero Gonzalez: ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [08:44:19] (03CR) 10David Caro: [C: 03+1] ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [08:45:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T314041)', diff saved to https://phabricator.wikimedia.org/P34988 and previous config saved to /var/cache/conftool/dbconfig/20220928-084535-ladsgroup.json [08:45:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [08:45:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:45:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [08:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T314041)', diff saved to https://phabricator.wikimedia.org/P34989 and previous config saved to /var/cache/conftool/dbconfig/20220928-084557-ladsgroup.json [08:49:51] !log disable puppet on cache serveres to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/832268 [08:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:34] (03PS1) 10Volans: redfish: use the management IP instead of FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/836127 (https://phabricator.wikimedia.org/T313979) [08:52:36] (03CR) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [08:52:41] (03CR) 10Jbond: [C: 03+2] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [08:52:50] (03PS1) 10Volans: redfish-based cookbooks: adapt to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) [09:02:02] (03CR) 10FNegri: [C: 03+1] "LGTM. Do you expect that after this patch is merged I will be able to run the `wmcs.toolforge.tests` cookbook successfully?" [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864) (owner: 10Arturo Borrero Gonzalez) [09:03:56] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:04:30] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 150, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:04:53] (03CR) 10Muehlenhoff: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [09:04:58] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 307, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: remove references to Debian Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864) (owner: 10Arturo Borrero Gonzalez) [09:11:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 59689 [09:11:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59689 [09:12:41] (03CR) 10Jbond: [C: 03+2] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [09:18:52] (03PS1) 10Ayounsi: sre.network.peering: fix email footer indentation [cookbooks] - 10https://gerrit.wikimedia.org/r/836130 [09:19:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:21:03] (03PS2) 10Ayounsi: sre.network.peering: fix email footer indentation [cookbooks] - 10https://gerrit.wikimedia.org/r/836130 [09:21:44] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [09:24:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10jbond) >>! In T252807#8238563, @Volans wrote: > Other things that should be verified/done: > * ensure that hosts that are already depooled (either with `no... [09:24:08] 10SRE, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans) For reference, `dbctl` does the latter, see for example: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/ma... [09:24:29] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/836130 (owner: 10Ayounsi) [09:24:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:25:43] (03CR) 10Ayounsi: [C: 03+2] sre.network.peering: fix email footer indentation [cookbooks] - 10https://gerrit.wikimedia.org/r/836130 (owner: 10Ayounsi) [09:25:45] (03CR) 10Clément Goubert: [C: 04-1] "Can you please revert the whitespace/formatting changes and submit them as a different change if they are needed? As it is, they make the " [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [09:26:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:28:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) >>! In T252807#8267630, @jbond wrote: >>>! In T252807#8238563, @Volans wrote: >> Other things that should be verified/done: >> * ensure that hosts... [09:29:26] (03Merged) 10jenkins-bot: sre.network.peering: fix email footer indentation [cookbooks] - 10https://gerrit.wikimedia.org/r/836130 (owner: 10Ayounsi) [09:31:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:33:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10jbond) [09:33:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10jbond) p:05Triage→03Medium [09:36:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10jbond) >>! In T252807#8267637, @Volans wrote: >>>! In T252807#8267630, @jbond wrote: >>>>! In T252807#8238563, @Volans wrote: >>> Other things that should... [09:38:34] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:42:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [09:42:36] (03CR) 10Jbond: [C: 03+1] "lgtm and previous Cr now merged" [puppet] - 10https://gerrit.wikimedia.org/r/836093 (owner: 10Giuseppe Lavagetto) [09:47:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:47:37] (03PS1) 10Clément Goubert: pontoon: move sops-appservers stack to appservers [puppet] - 10https://gerrit.wikimedia.org/r/836132 [09:49:42] (03Abandoned) 10Clément Goubert: pontoon: move sops-appservers stack to appservers [puppet] - 10https://gerrit.wikimedia.org/r/836132 (owner: 10Clément Goubert) [09:50:00] (03PS1) 10Clément Goubert: pontoon: move sops-appservers stack to appservers [puppet] - 10https://gerrit.wikimedia.org/r/836133 [09:52:15] (03PS2) 10Clément Goubert: pontoon: move sops-appservers stack to appservers [puppet] - 10https://gerrit.wikimedia.org/r/836133 (https://phabricator.wikimedia.org/T318671) [09:52:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:53:53] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) I think that the only patch left to be merged is https://gerrit.wikimedia.org/r/c/operations/dns/+/793728, pe... [09:55:05] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] haproxy: use haproxy24 component [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/832235 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:00:17] (03CR) 10Ladsgroup: "If Hugh is happy with it, I can deploy it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [10:00:41] (03PS5) 10Ladsgroup: Enable Linter write of namespace tag and template fields on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [10:00:44] (03CR) 10Ladsgroup: [C: 03+2] Enable Linter write of namespace tag and template fields on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [10:01:34] (03Merged) 10jenkins-bot: Enable Linter write of namespace tag and template fields on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [10:04:18] (03CR) 10Jbond: [V: 03+1 C: 03+1] lvs: Convert ::lvs::configuration to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [10:05:38] (03PS1) 10Jelto: docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) [10:06:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:07:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:07:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:08:04] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [10:08:16] (03CR) 10Hnowlan: [C: 03+1] Update Translate job names (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [10:08:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:09:25] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37374/console" [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:09:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:09:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37375/console" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [10:10:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T314041)', diff saved to https://phabricator.wikimedia.org/P34990 and previous config saved to /var/cache/conftool/dbconfig/20220928-101001-ladsgroup.json [10:10:06] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:11:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:11:14] some spam incoming, sorry :) [10:12:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:12:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (22) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, phab1004, releases1002, releases2002 https://wikitech.wikimed [10:12:54] iki/Puppet%23check_puppet_run_changes [10:13:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:13:17] (03PS2) 10Jelto: docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) [10:13:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (22) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, phab1004, releases1002, releases2002 https://wikitech.wikimed [10:13:54] iki/Puppet%23check_puppet_run_changes [10:13:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:14:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:15:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:16:42] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37376/console" [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:16:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:02] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [10:17:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:18:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:18:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10MoritzMuehlenhoff) There's some overlap with T317855 as well [10:19:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:21:09] (03CR) 10David Caro: "Looks ok, just note to use the other module instead 😊" [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [10:22:36] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) (owner: 10Vivian Rook) [10:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P34992 and previous config saved to /var/cache/conftool/dbconfig/20220928-102508-ladsgroup.json [10:26:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:26:15] (03PS1) 10Jelto: docker_registry_ha: add codfw Trusted Runners to jwt_allowed_ips [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) [10:26:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 db1137 db1168 db1143 db1132 db1127 es1022 for mariadb upgrade T318128', diff saved to https://phabricator.wikimedia.org/P34993 and previous config saved to /var/cache/conftool/dbconfig/20220928-102759-root.json [10:28:03] T318128: Compile and install MariaDB 10.6.10 - https://phabricator.wikimedia.org/T318128 [10:28:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:28:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37377/console" [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:29:38] (03PS2) 10Muehlenhoff: Ship WMF-specific systemd unit parts as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) [10:29:45] (03CR) 10Muehlenhoff: Ship WMF-specific systemd unit parts as systemd override (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [10:30:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:30:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:31:14] btullis: o/ we may need to tune a little these alerts --^ [10:31:34] otherwise a regular roll restart spams IRC [10:32:10] Hmm, yeah. Good point, thanks elukey. [10:32:55] btullis: not a big deal but before I saw the cookbook roll restart msg in the SAL I thought that jumbo was having issues [10:34:39] Agreed. This is the first roll-reboot since the alerts were moved to alertmanager, I believe. [10:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34994 and previous config saved to /var/cache/conftool/dbconfig/20220928-103810-root.json [10:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34995 and previous config saved to /var/cache/conftool/dbconfig/20220928-103815-root.json [10:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34996 and previous config saved to /var/cache/conftool/dbconfig/20220928-103822-root.json [10:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34997 and previous config saved to /var/cache/conftool/dbconfig/20220928-103827-root.json [10:38:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34998 and previous config saved to /var/cache/conftool/dbconfig/20220928-103835-root.json [10:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34999 and previous config saved to /var/cache/conftool/dbconfig/20220928-103840-root.json [10:38:45] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:38:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35000 and previous config saved to /var/cache/conftool/dbconfig/20220928-103847-root.json [10:39:00] (03CR) 10David Caro: "btw. We can pair on this if you want for some time to get it going" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:39:19] (03CR) 10Arturo Borrero Gonzalez: "We may need to review the hiera config before merging this patch." [puppet] - 10https://gerrit.wikimedia.org/r/835657 (https://phabricator.wikimedia.org/T316284) (owner: 10Andrew Bogott) [10:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P35001 and previous config saved to /var/cache/conftool/dbconfig/20220928-104014-ladsgroup.json [10:41:54] (03PS1) 10Marostegui: control-mariadb-client-10.6-bullseye: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/836140 (https://phabricator.wikimedia.org/T318128) [10:43:36] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.6-bullseye: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/836140 (https://phabricator.wikimedia.org/T318128) (owner: 10Marostegui) [10:43:41] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: even more cleanups [puppet] - 10https://gerrit.wikimedia.org/r/836141 [10:43:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:44:13] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bullseye: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/836140 (https://phabricator.wikimedia.org/T318128) (owner: 10Marostegui) [10:48:17] (03PS2) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) [10:48:36] (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [10:49:35] (03CR) 10FNegri: [C: 03+1] toolforge: automated-tests: even more cleanups [puppet] - 10https://gerrit.wikimedia.org/r/836141 (owner: 10Arturo Borrero Gonzalez) [10:50:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: even more cleanups [puppet] - 10https://gerrit.wikimedia.org/r/836141 (owner: 10Arturo Borrero Gonzalez) [10:51:33] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [10:52:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:52:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35002 and previous config saved to /var/cache/conftool/dbconfig/20220928-105315-root.json [10:53:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35003 and previous config saved to /var/cache/conftool/dbconfig/20220928-105320-root.json [10:53:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35004 and previous config saved to /var/cache/conftool/dbconfig/20220928-105327-root.json [10:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35005 and previous config saved to /var/cache/conftool/dbconfig/20220928-105332-root.json [10:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35006 and previous config saved to /var/cache/conftool/dbconfig/20220928-105340-root.json [10:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35007 and previous config saved to /var/cache/conftool/dbconfig/20220928-105345-root.json [10:53:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35008 and previous config saved to /var/cache/conftool/dbconfig/20220928-105351-root.json [10:54:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:54:58] (03PS2) 10Giuseppe Lavagetto: deployment_server: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836094 (https://phabricator.wikimedia.org/T271736) [10:55:00] (03PS1) 10Giuseppe Lavagetto: cloudweb: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836143 (https://phabricator.wikimedia.org/T271736) [10:55:03] (03PS1) 10Giuseppe Lavagetto: wikitech: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836144 (https://phabricator.wikimedia.org/T271736) [10:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T314041)', diff saved to https://phabricator.wikimedia.org/P35009 and previous config saved to /var/cache/conftool/dbconfig/20220928-105520-ladsgroup.json [10:55:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [10:55:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [10:55:25] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T314041)', diff saved to https://phabricator.wikimedia.org/P35010 and previous config saved to /var/cache/conftool/dbconfig/20220928-105531-ladsgroup.json [10:57:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:57:43] (03PS1) 10Jbond: spdx: ensure we also check for profile/role sub modules [puppet] - 10https://gerrit.wikimedia.org/r/836145 [10:57:45] (03PS1) 10Jbond: spdx: correct spelling mistake for unsigned_contibutors [puppet] - 10https://gerrit.wikimedia.org/r/836166 [11:00:27] (03CR) 10Jbond: "thanks for this and sorry that i forgot 😊. however see inline we are still missing a bit. i created https://gerrit.wikimedia.org/r/c/ope" [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:04:03] (03PS2) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 [11:05:57] (03CR) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [11:06:20] (03CR) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [11:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35011 and previous config saved to /var/cache/conftool/dbconfig/20220928-110821-root.json [11:08:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35012 and previous config saved to /var/cache/conftool/dbconfig/20220928-110825-root.json [11:08:29] (03PS3) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) [11:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35013 and previous config saved to /var/cache/conftool/dbconfig/20220928-110832-root.json [11:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35014 and previous config saved to /var/cache/conftool/dbconfig/20220928-110839-root.json [11:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35015 and previous config saved to /var/cache/conftool/dbconfig/20220928-110846-root.json [11:08:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35016 and previous config saved to /var/cache/conftool/dbconfig/20220928-110850-root.json [11:08:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35017 and previous config saved to /var/cache/conftool/dbconfig/20220928-110856-root.json [11:09:46] (03CR) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [11:13:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:18:20] !log installing expat security updates [11:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35018 and previous config saved to /var/cache/conftool/dbconfig/20220928-112326-root.json [11:23:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35019 and previous config saved to /var/cache/conftool/dbconfig/20220928-112330-root.json [11:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35020 and previous config saved to /var/cache/conftool/dbconfig/20220928-112337-root.json [11:23:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35021 and previous config saved to /var/cache/conftool/dbconfig/20220928-112344-root.json [11:23:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [11:23:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35022 and previous config saved to /var/cache/conftool/dbconfig/20220928-112351-root.json [11:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35023 and previous config saved to /var/cache/conftool/dbconfig/20220928-112355-root.json [11:24:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35024 and previous config saved to /var/cache/conftool/dbconfig/20220928-112401-root.json [11:24:43] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) Is there a sever we can use to test on also can you please describe exactly how you want the... [11:31:12] (03PS1) 10Muehlenhoff: Add Cumin alias for ML staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/836175 [11:32:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/836127 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [11:35:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35025 and previous config saved to /var/cache/conftool/dbconfig/20220928-113831-root.json [11:38:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35026 and previous config saved to /var/cache/conftool/dbconfig/20220928-113835-root.json [11:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35027 and previous config saved to /var/cache/conftool/dbconfig/20220928-113842-root.json [11:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35028 and previous config saved to /var/cache/conftool/dbconfig/20220928-113849-root.json [11:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35029 and previous config saved to /var/cache/conftool/dbconfig/20220928-113856-root.json [11:39:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35030 and previous config saved to /var/cache/conftool/dbconfig/20220928-113900-root.json [11:39:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35031 and previous config saved to /var/cache/conftool/dbconfig/20220928-113906-root.json [11:40:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:41:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:47:53] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Declined→03Open [11:48:25] (03Restored) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 (owner: 10Hashar) [11:48:33] (03PS5) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 [11:49:22] (03CR) 10Hashar: "Done later on by ca26af20e555d2e56b6effe5742ff00072291da5" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 (owner: 10Hashar) [11:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35032 and previous config saved to /var/cache/conftool/dbconfig/20220928-115336-root.json [11:53:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35033 and previous config saved to /var/cache/conftool/dbconfig/20220928-115340-root.json [11:53:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35034 and previous config saved to /var/cache/conftool/dbconfig/20220928-115347-root.json [11:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35035 and previous config saved to /var/cache/conftool/dbconfig/20220928-115354-root.json [11:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35036 and previous config saved to /var/cache/conftool/dbconfig/20220928-115401-root.json [11:54:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35037 and previous config saved to /var/cache/conftool/dbconfig/20220928-115404-root.json [11:54:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35038 and previous config saved to /var/cache/conftool/dbconfig/20220928-115411-root.json [11:56:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:58:54] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [11:59:32] (03PS7) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [12:00:25] (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:01:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [12:07:11] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10CDanis) [12:08:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [12:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35039 and previous config saved to /var/cache/conftool/dbconfig/20220928-120841-root.json [12:08:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35040 and previous config saved to /var/cache/conftool/dbconfig/20220928-120845-root.json [12:08:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35041 and previous config saved to /var/cache/conftool/dbconfig/20220928-120852-root.json [12:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35042 and previous config saved to /var/cache/conftool/dbconfig/20220928-120858-root.json [12:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35043 and previous config saved to /var/cache/conftool/dbconfig/20220928-120906-root.json [12:09:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35044 and previous config saved to /var/cache/conftool/dbconfig/20220928-120909-root.json [12:09:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35045 and previous config saved to /var/cache/conftool/dbconfig/20220928-120916-root.json [12:09:42] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [12:11:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [12:15:35] (03PS1) 10Muehlenhoff: New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 [12:15:46] (03PS2) 10Muehlenhoff: New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 [12:18:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [12:18:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [12:18:55] (03Restored) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [12:19:03] (03CR) 10Hashar: python-build: reuse previously built wheels (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [12:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2180 db2146 db2122 es2022 for mariadb upgrade T318128', diff saved to https://phabricator.wikimedia.org/P35046 and previous config saved to /var/cache/conftool/dbconfig/20220928-121912-root.json [12:19:17] T318128: Compile and install MariaDB 10.6.10 - https://phabricator.wikimedia.org/T318128 [12:21:08] !log copying wmf-elasticsearh-search-plugins from bullseye to buster (`reprepro -C elastic710 buster-wikimedia bullseye-wikimedia wmf-elasticsearch-search-plugins`) [12:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:44] !log re-enable Init7 in knams [12:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:05] !log above reprepro copy failed, elastic710 component does not exist yet [12:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P35047 and previous config saved to /var/cache/conftool/dbconfig/20220928-122321-root.json [12:23:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [12:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35048 and previous config saved to /var/cache/conftool/dbconfig/20220928-122346-root.json [12:23:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35049 and previous config saved to /var/cache/conftool/dbconfig/20220928-122350-root.json [12:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35050 and previous config saved to /var/cache/conftool/dbconfig/20220928-122356-root.json [12:24:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35051 and previous config saved to /var/cache/conftool/dbconfig/20220928-122403-root.json [12:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35052 and previous config saved to /var/cache/conftool/dbconfig/20220928-122411-root.json [12:24:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35053 and previous config saved to /var/cache/conftool/dbconfig/20220928-122414-root.json [12:24:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35054 and previous config saved to /var/cache/conftool/dbconfig/20220928-122415-root.json [12:24:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35055 and previous config saved to /var/cache/conftool/dbconfig/20220928-122421-root.json [12:24:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35056 and previous config saved to /var/cache/conftool/dbconfig/20220928-122422-root.json [12:24:23] !log copying wmf-elasticsearh-search-plugins from bullseye to buster (`reprepro -C thirdparty/elastic710 copy buster-wikimedia bullseye-wikimedia wmf-elasticsearch-search-plugins`) [12:24:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35057 and previous config saved to /var/cache/conftool/dbconfig/20220928-122427-root.json [12:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35058 and previous config saved to /var/cache/conftool/dbconfig/20220928-122432-root.json [12:26:30] (03CR) 10Ayounsi: [C: 03+1] "Noted, thanks for the explanation." [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [12:28:59] btullis: I was looking at the kafka broker unavailable alerts passing by, interesting because it is quite close to the alert duration (vs the cookbook rate of restarts) [12:29:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:03] btullis: https://thanos.wikimedia.org/graph?g0.expr=min(kafka_server_KafkaServer_BrokerState%7Bkafka_cluster!~%22test.*%22%7D)%20by%20(kafka_cluster)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [12:29:58] i.e ~3m vs 2m of the alert [12:34:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [12:35:48] (03CR) 10ArielGlenn: dumpcirrussearch.sh: Replace gzip with lbzip2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835705 (owner: 10Ebernhardson) [12:39:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35060 and previous config saved to /var/cache/conftool/dbconfig/20220928-123920-root.json [12:39:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35061 and previous config saved to /var/cache/conftool/dbconfig/20220928-123927-root.json [12:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35062 and previous config saved to /var/cache/conftool/dbconfig/20220928-123932-root.json [12:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35063 and previous config saved to /var/cache/conftool/dbconfig/20220928-123937-root.json [12:44:45] (03PS6) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) [12:46:00] (03Abandoned) 10Clément Goubert: pontoon: move sops-appservers stack to appservers [puppet] - 10https://gerrit.wikimedia.org/r/836133 (https://phabricator.wikimedia.org/T318671) (owner: 10Clément Goubert) [12:46:36] (03CR) 10Hashar: "I have to test it a bit more. Will probably add a another change to replace "python setup.py bdist_wheel" and make it a lot more stricter " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [12:51:05] (03CR) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:52:17] (03CR) 10Bking: k8s: Limit envoy metrics scraped from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [12:54:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35064 and previous config saved to /var/cache/conftool/dbconfig/20220928-125425-root.json [12:54:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35065 and previous config saved to /var/cache/conftool/dbconfig/20220928-125432-root.json [12:54:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35066 and previous config saved to /var/cache/conftool/dbconfig/20220928-125436-root.json [12:54:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35067 and previous config saved to /var/cache/conftool/dbconfig/20220928-125442-root.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:01:07] indeed, nothing to do [13:01:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:02:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:03:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:03:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! I've also tested it with one profile (docker) and one role (docker::ci) which lack contributor signoff and it correctl" [puppet] - 10https://gerrit.wikimedia.org/r/836145 (owner: 10Jbond) [13:04:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:04:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836166 (owner: 10Jbond) [13:04:20] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:47] (03PS3) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 [13:04:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:05:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:05:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:06:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:06:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:09:22] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:09:24] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35068 and previous config saved to /var/cache/conftool/dbconfig/20220928-130930-root.json [13:09:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35069 and previous config saved to /var/cache/conftool/dbconfig/20220928-130937-root.json [13:09:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35070 and previous config saved to /var/cache/conftool/dbconfig/20220928-130941-root.json [13:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35071 and previous config saved to /var/cache/conftool/dbconfig/20220928-130947-root.json [13:11:09] (03CR) 10Jbond: [C: 03+2] spdx: ensure we also check for profile/role sub modules [puppet] - 10https://gerrit.wikimedia.org/r/836145 (owner: 10Jbond) [13:11:12] (03CR) 10Jbond: [C: 03+2] spdx: correct spelling mistake for unsigned_contibutors [puppet] - 10https://gerrit.wikimedia.org/r/836166 (owner: 10Jbond) [13:11:19] (03PS13) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [13:11:40] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:12:19] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) (owner: 10Clément Goubert) [13:14:59] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [13:15:25] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [13:16:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:25] (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [13:17:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:19:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:20:57] (03Merged) 10jenkins-bot: New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [13:23:02] (03Abandoned) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35072 and previous config saved to /var/cache/conftool/dbconfig/20220928-132435-root.json [13:24:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Jdforrester-WMF) [13:24:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35073 and previous config saved to /var/cache/conftool/dbconfig/20220928-132442-root.json [13:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35074 and previous config saved to /var/cache/conftool/dbconfig/20220928-132446-root.json [13:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35075 and previous config saved to /var/cache/conftool/dbconfig/20220928-132452-root.json [13:24:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Jdforrester-WMF) [13:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:29:46] (03CR) 10Klausman: [C: 03+1] Add Cumin alias for ML staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/836175 (owner: 10Muehlenhoff) [13:30:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 42 [13:30:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:31:07] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [13:31:33] (03PS1) 10Clément Goubert: pontoon: switch wmcs project for sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/836213 [13:31:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42 [13:31:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:32:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 577 [13:32:45] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-thanos-fe rolling restart_daemons on A:thanos-fe-codfw [13:32:51] (03CR) 10Clément Goubert: [C: 03+2] pontoon: switch wmcs project for sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/836213 (owner: 10Clément Goubert) [13:33:10] (03PS2) 10Muehlenhoff: Add Cumin alias for ML staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/836175 [13:33:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 577 [13:33:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [13:34:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.o11y.roll-restart-reboot-thanos-fe (exit_code=1) rolling restart_daemons on A:thanos-fe-codfw [13:34:32] (03CR) 10Volans: "replies inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:34:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Slst2020) [13:35:40] (03PS1) 10Muehlenhoff: sre.o11y.roll-restart-reboot-thanos-fe: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/836214 [13:36:10] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:36:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Slst2020) [13:37:52] (03PS1) 10Clément Goubert: pontoon: fix sops-appservers server names [puppet] - 10https://gerrit.wikimedia.org/r/836215 [13:38:53] (03CR) 10Clément Goubert: [C: 03+2] pontoon: fix sops-appservers server names [puppet] - 10https://gerrit.wikimedia.org/r/836215 (owner: 10Clément Goubert) [13:39:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35076 and previous config saved to /var/cache/conftool/dbconfig/20220928-133940-root.json [13:39:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35077 and previous config saved to /var/cache/conftool/dbconfig/20220928-133946-root.json [13:39:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35078 and previous config saved to /var/cache/conftool/dbconfig/20220928-133951-root.json [13:39:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35079 and previous config saved to /var/cache/conftool/dbconfig/20220928-133957-root.json [13:40:13] (03CR) 10Jbond: customscripts: export 'mgmt' entries from hiera_export (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:41:43] (03CR) 10Volans: [C: 03+2] redfish: use the management IP instead of FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/836127 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [13:42:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 714 [13:44:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 714 [13:44:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 812 [13:45:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 812 [13:45:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1273 [13:45:28] (03PS3) 10Muehlenhoff: docker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832252 (https://phabricator.wikimedia.org/T308013) [13:45:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1273 [13:46:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2603 [13:46:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2603 [13:46:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2635 [13:47:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2635 [13:47:37] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2647 [13:48:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2647 [13:48:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2906 [13:49:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.wikimedia.org [13:49:49] (03Merged) 10jenkins-bot: redfish: use the management IP instead of FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/836127 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [13:50:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2906 [13:50:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3292 [13:50:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [13:50:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3292 [13:50:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3300 [13:51:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:51:39] (03PS14) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [13:52:14] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [13:53:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3300 [13:53:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3856 [13:54:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35080 and previous config saved to /var/cache/conftool/dbconfig/20220928-135445-root.json [13:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35081 and previous config saved to /var/cache/conftool/dbconfig/20220928-135451-root.json [13:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35082 and previous config saved to /var/cache/conftool/dbconfig/20220928-135456-root.json [13:55:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35083 and previous config saved to /var/cache/conftool/dbconfig/20220928-135502-root.json [13:55:08] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [13:55:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3856 [13:55:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4181 [13:55:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4181 [13:55:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4230 [13:56:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4230 [13:56:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4637 [13:57:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4637 [13:57:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4775 [13:59:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4775 [13:59:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4826 [13:59:13] (03CR) 10Muehlenhoff: [C: 03+2] sre.o11y.roll-restart-reboot-thanos-fe: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/836214 (owner: 10Muehlenhoff) [13:59:31] (03PS2) 10Bking: k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) [13:59:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4826 [13:59:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4922 [14:00:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4922 [14:00:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5400 [14:00:24] (03CR) 10CI reject: [V: 04-1] k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [14:00:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5400 [14:00:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5650 [14:01:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5650 [14:01:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6079 [14:01:44] hello! I and one other ChromeOS user are experiencing very slow pageloads ( >1 min.) on one particular page. This might just be a regular issue to report on Phab, but since it does involve pageloading, swinging by here first to see if it's SRE territory. Page is https://en.wiktionary.org/wiki/Reconstruction:Proto-Iranian/H%C3%A1cwah [14:01:44] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [14:02:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6079 [14:02:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6128 [14:02:07] When the page does load, formatting is broken https://usercontent.irccloud-cdn.com/file/Enlco0lc/image.png [14:02:37] Specifically on ChromeOS, that is. Someone on another browser reports https://usercontent.irccloud-cdn.com/file/YkhRIgS8/image.png [14:02:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6128 [14:02:41] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-thanos-fe rolling restart_daemons on A:thanos-fe-codfw [14:02:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6614 [14:03:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6614 [14:03:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6762 [14:03:22] So I'm assuming some mix of a wiki-side issue and a server one? [14:03:44] (03PS3) 10Bking: k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) [14:03:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host graphite1005.mgmt.eqiad.wmnet with reboot policy FORCED [14:04:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6762 [14:04:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7195 [14:04:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-thanos-fe (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [14:04:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7195 [14:04:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7713 [14:05:38] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7713 [14:05:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7784 [14:06:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7784 [14:06:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7795 [14:06:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7795 [14:06:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7843 [14:06:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7843 [14:06:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8075 [14:07:07] Tamzin: does it sound similar to https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#logged-in_editing_is_suddenly_messed_up ? [14:07:17] Tamzin: SRE tracks things in Phabricator too, btw [14:08:03] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-thanos-fe rolling restart_daemons on A:thanos-fe-eqiad [14:08:07] I'll note safemode does fix this [14:08:16] (03PS15) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [14:08:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8075 [14:08:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8359 [14:08:32] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudrabbit1003.wikimedia.org [14:08:32] TheresNoTime: but no that does't sound very similar [14:08:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8359 [14:08:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8674 [14:09:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bullseye [14:09:17] And not logged-in-exclusive. Currently trying to load in Incognito... [14:09:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-thanos-fe (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [14:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35084 and previous config saved to /var/cache/conftool/dbconfig/20220928-140950-root.json [14:09:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8674 [14:09:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8781 [14:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35085 and previous config saved to /var/cache/conftool/dbconfig/20220928-140956-root.json [14:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35086 and previous config saved to /var/cache/conftool/dbconfig/20220928-141001-root.json [14:10:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35087 and previous config saved to /var/cache/conftool/dbconfig/20220928-141007-root.json [14:10:23] finally loaded [14:10:38] so 65 seconds, plus 10-15 before I posted the first message [14:11:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8781 [14:11:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8966 [14:11:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [14:11:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8966 [14:12:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 10310 [14:12:12] primefac suggested https://en.wiktionary.org/w/index.php?title=Template%3Atop3&type=revision&diff=69389736&oldid=53305615 as the culprit, although the bridge from there to an 80-second pageload time, I'm not sure [14:12:26] !log added python3-gjson v0.0.5 to apt.w.o (bullseye only) [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:43] (03CR) 10Bking: k8s: Limit envoy metrics scraped from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [14:13:20] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [14:14:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10310 [14:14:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 11039 [14:14:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11039 [14:14:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 11164 [14:14:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11164 [14:15:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12041 [14:15:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12041 [14:15:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12200 [14:15:42] (03CR) 10JHathaway: Fix config template for OTRS or VRTS aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [14:15:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12200 [14:15:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13335 [14:16:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host graphite1005.mgmt.eqiad.wmnet with reboot policy FORCED [14:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T314041)', diff saved to https://phabricator.wikimedia.org/P35088 and previous config saved to /var/cache/conftool/dbconfig/20220928-141638-ladsgroup.json [14:16:43] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:17:36] (03CR) 10Muehlenhoff: [C: 03+2] Ship WMF-specific systemd unit parts as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [14:18:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13335 [14:18:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13489 [14:18:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13489 [14:18:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13760 [14:18:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13760 [14:18:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 14361 [14:19:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14361 [14:19:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 14630 [14:19:55] (03PS1) 10Cmjohnson: Adding graphite1005 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/836221 (https://phabricator.wikimedia.org/T313853) [14:20:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14630 [14:20:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15133 [14:21:27] (03CR) 10Cmjohnson: [C: 03+2] Adding graphite1005 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/836221 (https://phabricator.wikimedia.org/T313853) (owner: 10Cmjohnson) [14:21:33] (03PS16) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [14:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:21:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15133 [14:21:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15695 [14:22:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15695 [14:22:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16276 [14:24:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16276 [14:24:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16735 [14:24:20] (03CR) 10Muehlenhoff: [C: 03+2] docker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832252 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:24:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16735 [14:24:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 18106 [14:24:45] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:07] (03CR) 10Bking: [C: 04-1] "Need to use "metrics_relabel_config" as in ops.pp" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [14:25:15] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [14:25:33] (03PS2) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) [14:26:35] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for ML staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/836175 (owner: 10Muehlenhoff) [14:26:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18106 [14:26:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 19108 [14:26:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [14:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:27:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19108 [14:27:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 19151 [14:27:39] 10SRE, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10wdwb-tech: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10jijiki) [14:27:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19151 [14:27:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 19653 [14:28:12] Tamzin: if it only happens with a page, it will most likely be related to content and/or code (e.g. a software bug), rather than a server or network issue; for those it will be better to file a phabricator bug, as usually only sysadmins and deployers are tracking real time reports [14:28:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19653 [14:28:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20115 [14:28:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20115 [14:28:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21928 [14:28:46] jynus: Think I just figured it out. Ticket inbound [14:29:00] great! [14:29:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21928 [14:29:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21949 [14:29:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host graphite1005.eqiad.wmnet with OS bullseye [14:29:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host graphite1005.eqiad.wmnet with OS bullseye [14:29:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [14:29:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21949 [14:29:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22616 [14:29:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:54] 10SRE, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10wdwb-tech: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore) 05Open→03Invalid Didnt happen in 2020 / since, so closing this now [14:30:28] (03PS3) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) [14:30:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22616 [14:30:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22773 [14:30:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22773 [14:30:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22987 [14:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P35089 and previous config saved to /var/cache/conftool/dbconfig/20220928-143145-ladsgroup.json [14:32:00] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [14:32:12] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgEntityUsageModifierLimitsStatement on cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) [14:32:44] (03CR) 10Lucas Werkmeister (WMDE): "Seeking DBA approval :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836227 (https://phabricator.wikimedia.org/T296384) (owner: 10Lucas Werkmeister (WMDE)) [14:33:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22987 [14:33:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 25885 [14:34:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 25885 [14:34:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 26744 [14:34:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:35:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26744 [14:35:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29791 [14:35:41] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [14:36:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29791 [14:36:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32098 [14:38:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32098 [14:38:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32787 [14:39:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32787 [14:39:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32934 [14:39:45] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:41:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32934 [14:41:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35280 [14:41:18] (03PS27) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [14:41:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on graphite1005.eqiad.wmnet with reason: host reimage [14:41:20] (03PS17) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [14:42:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35280 [14:42:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36351 [14:43:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36351 [14:43:17] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36692 [14:43:51] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Papaul) @Marostegui i have a 32G DIMM. if the server is in service can you please depool and poweroff? Thanks [14:44:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS buster [14:44:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36692 [14:44:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [14:44:41] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster [14:44:55] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [14:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:45:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [14:45:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46450 [14:45:04] (03PS18) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [14:45:06] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [14:45:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on graphite1005.eqiad.wmnet with reason: host reimage [14:45:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.wikimedia.org with OS bullseye [14:45:27] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [14:45:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46450 [14:45:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 52320 [14:46:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52320 [14:46:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 53334 [14:46:16] (03PS4) 10Bking: k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) [14:46:18] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10jcrespo) @Papaul The host can be poweroff uncleanly (just plug off). Service was migrated temporarily already. [14:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P35090 and previous config saved to /var/cache/conftool/dbconfig/20220928-144651-ladsgroup.json [14:46:52] 10SRE, 10Discovery-Search: Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10matthiasmullie) [14:47:09] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Papaul) thanks [14:47:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 53334 [14:47:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 57695 [14:47:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 57695 [14:47:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62955 [14:48:13] (03CR) 10Muehlenhoff: [C: 03+2] query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:48:36] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [14:48:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62955 [14:48:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 65517 [14:48:51] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 65517 [14:48:52] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [14:48:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 199524 [14:48:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.wikimedia.org [14:50:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 [14:50:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 209453 [14:50:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 209453 [14:50:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 262589 [14:51:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262589 [14:51:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 393950 [14:52:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 393950 [14:52:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 394354 [14:52:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394354 [14:52:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:memcached Fix memcached bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) (owner: 10Clément Goubert) [14:53:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [14:55:25] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.wikimedia.org [14:59:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite1005.eqiad.wmnet with OS bullseye [15:00:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host graphite1005.eqiad.wmnet with OS bullseye completed: - gra... [15:00:22] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@aa7984f]: (no justification provided) [15:00:37] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@aa7984f]: (no justification provided) (duration: 00m 14s) [15:00:55] !log deploying Airflow for hdfsarchiver operator fix [15:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [15:01:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T314041)', diff saved to https://phabricator.wikimedia.org/P35091 and previous config saved to /var/cache/conftool/dbconfig/20220928-150158-ladsgroup.json [15:02:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:02:02] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:02:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:02:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35092 and previous config saved to /var/cache/conftool/dbconfig/20220928-150230-ladsgroup.json [15:03:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Cmjohnson) [15:03:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Cmjohnson) 05Open→03Resolved [15:06:25] (03PS1) 10Arturo Borrero Gonzalez: cloudnet: codfw1dev: switch to a single-NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/836240 (https://phabricator.wikimedia.org/T318824) [15:07:14] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Papaul) 05Open→03Resolved All good replaced DIMM 7 [15:07:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8674 [15:07:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:09:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8674 [15:09:52] !log installing twisted security updates [15:09:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [15:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 19108 [15:11:13] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [15:11:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 19108 [15:12:19] (03CR) 10Jbond: [C: 03+1] New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [15:12:26] (03PS3) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) [15:12:28] (03CR) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:12:43] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [15:12:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 714 [15:13:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [15:13:24] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10jcrespo) Thank you, I can confirm the recovery of the previous amount of memory: https://grafana.wikimedia.org/goto/ZVDBm7V4z?orgId=1 I will work to restore the original data from bac... [15:13:38] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:13:53] (03CR) 10CI reject: [V: 04-1] customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:15:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 714 [15:15:04] (03PS4) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) [15:16:03] (03CR) 10CI reject: [V: 04-1] customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:16:25] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [15:18:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4922 [15:18:31] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [15:18:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37378/console" [puppet] - 10https://gerrit.wikimedia.org/r/836094 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:19:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4922 [15:19:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/836094 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:20:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:36] (03CR) 10Filippo Giunchedi: "Looks like CI failure is unrelated" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [15:22:13] (03CR) 10JHathaway: [C: 03+2] Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [15:24:58] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:25:02] 10SRE, 10Infrastructure-Foundations, 10netops: Add peering sessions on cr1-eqiad Equinix port - https://phabricator.wikimedia.org/T294948 (10ayounsi) 05Open→03Resolved a:03ayounsi I used the new peering cookbook to mass email the ~76 AS that were on cr2 but not cr1. About ~10 replied quickly and some n... [15:25:55] (03PS8) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:26:36] !log installing libgoogle-gson-java security updates on bullseye [15:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2036.codfw.wmnet with OS buster [15:28:24] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster completed: - logstash2036 (**PASS**) -... [15:31:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37379/console" [puppet] - 10https://gerrit.wikimedia.org/r/836143 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:32:16] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) @Jclark-ctr @wiki_willy do you have any update about this task? This is set to high priority as it is to prevent an... [15:38:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cloudweb: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836143 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:41:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikitech: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/836144 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:44:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace nutcracker with mcrouter - https://phabricator.wikimedia.org/T318695 (10jijiki) [15:44:46] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [15:45:49] (03PS1) 10Eigyan: [config]: Deploy GDI survey Wave 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 [15:46:14] jouncebot now [15:46:14] No deployments scheduled for the next 2 hour(s) and 13 minute(s) [15:46:30] (03PS6) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [15:46:32] (03PS7) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [15:46:34] (03PS7) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [15:47:05] (03PS9) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:47:14] (03PS1) 10Ssingh: test_tls: remove deprecated option OP_NO_TLSv1_3 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/836247 [15:47:26] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:47:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [15:47:52] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [15:48:30] (03PS1) 10Volans: CHANGELOG: add changelogs for release v4.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/836248 [15:48:57] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v4.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/836248 (owner: 10Volans) [15:51:23] (03CR) 10Ssingh: [C: 03+2] test_tls: remove deprecated option OP_NO_TLSv1_3 [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/836247 (owner: 10Ssingh) [15:51:35] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@0646be1]: (no justification provided) [15:51:45] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@0646be1]: (no justification provided) (duration: 00m 10s) [15:53:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36351 [15:54:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I'll merge and import these tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:54:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36351 [15:55:08] (03CR) 10Arturo Borrero Gonzalez: "PCC looks correct:" [puppet] - 10https://gerrit.wikimedia.org/r/836240 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [15:55:38] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v4.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/836248 (owner: 10Volans) [15:56:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 40217 [15:57:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40217 [15:57:17] !log dancy@deploy1002 Installing scap version "4.24.0" for 561 hosts [15:57:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [15:57:36] !log dancy@deploy1002 Installation of scap version "4.24.0" completed for 561 hosts [15:58:48] (03PS2) 10Eigyan: [config]: Deploy GDI survey Wave 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 (https://phabricator.wikimedia.org/T318156) [15:59:09] (03PS1) 10Volans: Upstream release v4.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/836253 [15:59:20] (03CR) 10Volans: [C: 03+2] Upstream release v4.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/836253 (owner: 10Volans) [16:03:55] (03CR) 10Dduvall: [C: 03+1] "Thanks so much for finding/fixing this" [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [16:04:00] (03CR) 10Dduvall: [C: 03+1] docker_registry_ha: add codfw Trusted Runners to jwt_allowed_ips [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [16:04:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [16:06:21] (03CR) 10EllenR: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 (https://phabricator.wikimedia.org/T318156) (owner: 10Eigyan) [16:07:49] (03Merged) 10jenkins-bot: Upstream release v4.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/836253 (owner: 10Volans) [16:09:02] (03PS10) 10BCornwall: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) [16:09:24] (03CR) 10BCornwall: Unlink certificate renewal and OCSP handling (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [16:15:08] !log uploaded spicerack_4.0.0 to apt.wikimedia.org bullseye-wikimedia [16:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:07] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:45] (03CR) 10CI reject: [V: 04-1] Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [16:20:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2635 [16:21:16] 10SRE, 10serviceops: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10akosiaris) 05Open→03Resolved Yup, resolving. Thanks! [16:21:21] PHP 7.4 question: is there any expected timeline or tracking task for removing the patch that makes our PHP 7.4 emit the old serialization format (for compatibility with 7.2)? [16:21:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10nskaggs) I approve. [16:22:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2635 [16:24:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:25:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4775 [16:26:31] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 4775 [16:26:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:27:42] (03PS1) 10Majavah: O:toolforge: block local crontabs on accessible hosts [puppet] - 10https://gerrit.wikimedia.org/r/836258 [16:28:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 10310 [16:28:52] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good, and yeah seems relatively straightforward given it's tagging on eno2 already. Also quite Neutron-specific so probably not som" [puppet] - 10https://gerrit.wikimedia.org/r/836240 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [16:29:36] (03CR) 10Cathal Mooney: [C: 03+1] "Should also say if/when you want to merge this let me know, we'll need to change the switches at the same time to trunk those Vlans to the" [puppet] - 10https://gerrit.wikimedia.org/r/836240 (https://phabricator.wikimedia.org/T318824) (owner: 10Arturo Borrero Gonzalez) [16:31:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:31:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 10310 [16:33:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T314041)', diff saved to https://phabricator.wikimedia.org/P35093 and previous config saved to /var/cache/conftool/dbconfig/20220928-163329-ladsgroup.json [16:33:33] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:34:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:34:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:34:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [16:35:49] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) a:03jijiki [16:36:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1024.mgmt.eqiad.wmnet with reboot policy FORCED [16:36:08] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@f89d689]: (no justification provided) [16:36:14] 10SRE, 10serviceops, 10Performance-Team (Radar): Phase out nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10jijiki) a:05Joe→03jijiki [16:36:20] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@f89d689]: (no justification provided) (duration: 00m 12s) [16:38:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [16:39:13] (03PS1) 10JHathaway: mail::mx Remove LDAP support [puppet] - 10https://gerrit.wikimedia.org/r/836259 (https://phabricator.wikimedia.org/T244792) [16:39:43] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/836259 (https://phabricator.wikimedia.org/T244792) (owner: 10JHathaway) [16:39:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/836259 (https://phabricator.wikimedia.org/T244792) (owner: 10JHathaway) [16:40:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:40:49] 10SRE, 10serviceops, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10jijiki) [16:42:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @jclark-ctr can you verify the port for kubernetes1023, looks like something is already in c6/port 36 [16:43:55] 10SRE, 10serviceops, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10jijiki) [16:44:17] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [16:48:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P35095 and previous config saved to /var/cache/conftool/dbconfig/20220928-164835-ladsgroup.json [16:54:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 10310 [16:58:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1024.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 10310 [17:03:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/836259 (https://phabricator.wikimedia.org/T244792) (owner: 10JHathaway) [17:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P35096 and previous config saved to /var/cache/conftool/dbconfig/20220928-170342-ladsgroup.json [17:04:38] (03CR) 10JHathaway: [C: 03+2] mail::mx Remove LDAP support [puppet] - 10https://gerrit.wikimedia.org/r/836259 (https://phabricator.wikimedia.org/T244792) (owner: 10JHathaway) [17:12:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1024.eqiad.wmnet with OS bullseye [17:12:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye [17:16:21] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kubernetes1024.eqiad.wmnet with OS bullseye [17:16:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1... [17:17:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) Also, @jclark-ctr please check the network cables are in the correct port. 1024 is giving me a cable failure PXE-E61: Media test failure, check cable [17:17:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) [17:18:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T314041)', diff saved to https://phabricator.wikimedia.org/P35097 and previous config saved to /var/cache/conftool/dbconfig/20220928-171848-ladsgroup.json [17:18:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:18:57] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:19:03] (03CR) 10Jbond: "LGTM: see minor nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [17:19:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:19:56] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Cmjohnson) a:03Jclark-ctr [17:23:10] (03CR) 10BCornwall: [C: 03+1] P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [17:23:31] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:23:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4181 [17:24:20] (03PS2) 10Dduvall: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) [17:24:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4181 [17:26:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32098 [17:27:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32098 [17:29:42] (03PS2) 10Ayounsi: Inital FHRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218) [17:30:36] (03CR) 10Ayounsi: "Ready for reviews!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [17:33:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [17:34:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED [17:35:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 19653 [17:36:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 19653 [17:36:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [17:38:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Cmjohnson) [17:47:31] (03PS11) 10BCornwall: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) [17:50:39] (03CR) 10CI reject: [V: 04-1] Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [17:53:52] (03PS5) 10Vlad.shapik: Update the logic to run test coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [17:57:10] (03PS2) 10Jdlrobson: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) [18:00:05] brennen and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T1800). [18:00:05] brennen and jnuche: (Dis)respected human, time to deploy MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T1800). Please do the needful. [18:00:20] o/ [18:00:32] (03CR) 10Vlad.shapik: Update the logic to run test coverage (035 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [18:01:17] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836263 (https://phabricator.wikimedia.org/T314192) [18:01:19] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836263 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:02:02] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836263 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:06:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [18:06:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:06:31] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.3 refs T314192 [18:06:35] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [18:07:03] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [18:09:04] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10lbowmaker) [18:09:15] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10lbowmaker) [18:10:10] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.3 refs T314192 (duration: 03m 38s) [18:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:19] (03PS2) 10Volans: redfish-based cookbooks: adapt to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) [18:17:28] (03CR) 10Volans: "thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [18:20:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:21:56] (03CR) 10Volans: [C: 03+2] redfish-based cookbooks: adapt to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [18:22:03] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3f23a1b]: (no justification provided) [18:22:15] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3f23a1b]: (no justification provided) (duration: 00m 11s) [18:22:56] !log installed spicerack 4.0.0-1+deb11u1 on cumin2002 [18:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] (03Merged) 10jenkins-bot: redfish-based cookbooks: adapt to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/836128 (https://phabricator.wikimedia.org/T313979) (owner: 10Volans) [18:25:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:30:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:36:27] (03PS1) 10Volans: sre.hosts.provision: fix call to spicerack.redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/836265 [18:37:47] (03CR) 10Volans: [C: 03+2] "trivial fix, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/836265 (owner: 10Volans) [18:41:25] (03Merged) 10jenkins-bot: sre.hosts.provision: fix call to spicerack.redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/836265 (owner: 10Volans) [18:44:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) @Cmjohnson just checking in on these. Status update? Not a huge hurry, but we might want to start working on these in late October /... [19:08:15] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [19:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:39:44] (03PS4) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) [19:39:49] (03PS4) 10DDesouza: Deploy Research Incentive survey on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328) [19:39:53] (03PS3) 10DDesouza: Deploy Research Incentive survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834050 (https://phabricator.wikimedia.org/T318333) [19:59:08] Greetings everyone [19:59:16] 0// [19:59:33] o/ [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220928T2000). [20:00:04] danisztls and eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:37] hihi, I can deploy [20:00:56] ty! [20:00:57] :) [20:01:06] thank you TheresNoTime [20:01:21] :D gimme 1 sec to set everything up [20:02:27] danisztls: going to start with yours :) [20:02:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:03:08] (03PS1) 10Majavah: P:toolforge::shell_environ: explicitely install emacs-nox [puppet] - 10https://gerrit.wikimedia.org/r/836279 (https://phabricator.wikimedia.org/T318858) [20:04:04] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza) [20:04:33] !log samtar@deploy1002 Started scap: Backport for [[gerrit:834042|Deploy Research Incentive survey on arwiki (T318328)]] [20:04:37] T318328: Deploy Research Incentive Survey on Arabic Wikipedia - https://phabricator.wikimedia.org/T318328 [20:04:57] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:834042|Deploy Research Incentive survey on arwiki (T318328)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:05:07] danisztls: that's 834042 live on mwdebug, can you test? :) [20:05:16] will test [20:05:24] (03CR) 10BryanDavis: [C: 03+1] P:toolforge::shell_environ: explicitely install emacs-nox [puppet] - 10https://gerrit.wikimedia.org/r/836279 (https://phabricator.wikimedia.org/T318858) (owner: 10Majavah) [20:06:08] (03PS5) 10Samtar: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [20:08:46] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [20:10:01] how's it looking danisztls? :) [20:10:42] TheresNoTime: Some messages are missing translations [20:10:48] hmmm [20:11:04] not sure if they will be live if train deployment [20:11:09] danisztls: do we need to revert that one and try another? [20:11:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:29] I'm thinking on reverting it and deployment next week [20:11:40] *deploying [20:11:42] sure, going to revert [20:11:46] !log samtar@deploy1002 Sync cancelled. [20:11:57] We can skip the eswiki patch (same issue) [20:12:20] okay, 834050 worth trying? [20:12:30] (enwiki beta) [20:12:38] yes [20:12:48] please merge the beta change [20:12:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834050 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:14:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:24] danisztls: guessing this can just be sync'd once merged? [20:14:31] yes [20:14:37] :) [20:14:54] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834050 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:15:28] TheresNoTime: thanks [20:15:54] sync'd :) doing yours now eigyan [20:16:05] (03PS3) 10Samtar: [config]: Deploy GDI survey Wave 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 (https://phabricator.wikimedia.org/T318156) (owner: 10Eigyan) [20:16:14] TheresNoTime thank you! [20:17:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 (https://phabricator.wikimedia.org/T318156) (owner: 10Eigyan) [20:18:03] (03Merged) 10jenkins-bot: [config]: Deploy GDI survey Wave 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836244 (https://phabricator.wikimedia.org/T318156) (owner: 10Eigyan) [20:18:29] !log samtar@deploy1002 Started scap: Backport for [[gerrit:836244|[config]: Deploy GDI survey Wave 3 (T318156)]] [20:18:33] T318156: Deploy GDI Safety Survey Wave 3 on EN, ES, FR, PT wikis - week of Sept. 26, 2022 - https://phabricator.wikimedia.org/T318156 [20:18:53] !log samtar@deploy1002 samtar and essexigyan: Backport for [[gerrit:836244|[config]: Deploy GDI survey Wave 3 (T318156)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:19:01] eigyan: that's live on mwdebug, can you test please? :) [20:19:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:11] will do TheresNoTime [20:19:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:24] all is 💯 TheresNoTime [20:20:35] \o/ syncing [20:20:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:20:50] :) [20:21:03] thank you for all you do TheresNoTime [20:21:31] Hear, hear. TheresNoTime is awesome. [20:21:38] when things go well, this is an easy thing to do :D [20:21:52] (but thank you ^^) [20:22:15] Agreed, TheresNoTime is most excellent [20:22:28] sssh :p [20:23:03] the people who worked on https://wikitech.wikimedia.org/wiki/Scap#scap_backport are the really awesome people! Such a great tool :) [20:24:49] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:836244|[config]: Deploy GDI survey Wave 3 (T318156)]] (duration: 06m 19s) [20:24:53] T318156: Deploy GDI Safety Survey Wave 3 on EN, ES, FR, PT wikis - week of Sept. 26, 2022 - https://phabricator.wikimedia.org/T318156 [20:24:56] (03CR) 10Urbanecm: [C: 04-2] "WMF Legal got back to me. They stated that "account's 2FA status would constitute non-public personal information". This means access to i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [20:25:10] eigyan: all sync'd - worth another quick check in production if you don't mind :) [20:25:32] will do TheresNoTime [20:25:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:05] (03Abandoned) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [20:26:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:02] * TheresNoTime will be around for a little while, are there any other patches? [20:29:35] all is well with my patch, thanks again TheresNoTime [20:29:45] eigyan: you're very welcome :) [20:39:28] !log closing UTC late backport window [20:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:12] (03PS1) 10Andrew Bogott: Add nova-fullstack service to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/836292 [20:50:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 12200 [20:50:33] (03CR) 10CI reject: [V: 04-1] Add nova-fullstack service to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/836292 (owner: 10Andrew Bogott) [20:50:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12200 [20:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35098 and previous config saved to /var/cache/conftool/dbconfig/20220928-205117-ladsgroup.json [20:51:23] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:52:30] (03PS2) 10Andrew Bogott: Add nova-fullstack service to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/836292 [20:54:43] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::shell_environ: explicitely install emacs-nox [puppet] - 10https://gerrit.wikimedia.org/r/836279 (https://phabricator.wikimedia.org/T318858) (owner: 10Majavah) [20:55:14] (03CR) 10Andrew Bogott: [C: 03+2] Add nova-fullstack service to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/836292 (owner: 10Andrew Bogott) [20:57:50] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:59:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr can you verify that mgmt cables are connected to these servers please? [21:01:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Cmjohnson) [21:06:16] !log installed spicerack 4.0.0-1+deb11u1 on cumin1001 [21:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35099 and previous config saved to /var/cache/conftool/dbconfig/20220928-210624-ladsgroup.json [21:21:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35100 and previous config saved to /var/cache/conftool/dbconfig/20220928-212131-ladsgroup.json [21:22:03] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:30:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:35:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:36:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35101 and previous config saved to /var/cache/conftool/dbconfig/20220928-213640-ladsgroup.json [21:36:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:36:44] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:36:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:37:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P35102 and previous config saved to /var/cache/conftool/dbconfig/20220928-213701-ladsgroup.json [21:52:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) [21:56:09] (03PS3) 10Jdlrobson: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) [21:58:41] (03PS1) 10Ebernhardson: wmgCirrusSearchShardCount: Override prod settings for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836301 (https://phabricator.wikimedia.org/T316711) [21:58:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:05:29] (03CR) 10Dduvall: "I went ahead and cherry picked this on gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud and it's working as expected." [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto) [22:08:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:14:59] (03PS1) 10DLynch: Stop mobile visual enhancements from rolling out to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) [22:15:38] (03PS2) 10DLynch: Stop mobile visual enhancements from rolling out to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836304 (https://phabricator.wikimedia.org/T318871) [22:40:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:45:19] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:58:44] (03PS1) 10Andrew Bogott: nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 [22:59:34] (03CR) 10CI reject: [V: 04-1] nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 (owner: 10Andrew Bogott) [23:02:33] (03PS2) 10Andrew Bogott: nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 [23:05:32] (03CR) 10CI reject: [V: 04-1] nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 (owner: 10Andrew Bogott) [23:06:38] (03PS3) 10Andrew Bogott: nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 [23:10:30] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: set --deployment [puppet] - 10https://gerrit.wikimedia.org/r/836306 (owner: 10Andrew Bogott) [23:17:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [23:17:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [23:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T314041)', diff saved to https://phabricator.wikimedia.org/P35103 and previous config saved to /var/cache/conftool/dbconfig/20220928-231719-ladsgroup.json [23:17:23] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:23:34] (03PS1) 10Raymond Ndibe: prometheus: Add new scrape target [puppet] - 10https://gerrit.wikimedia.org/r/836310 [23:25:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:48:44] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) [23:50:26] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[28-39].codfw.wmnet - https://phabricator.wikimedia.org/T318689 (10Papaul) [23:51:04] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2037'] [23:52:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2037'] [23:53:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2037.codfw.wmnet with OS buster [23:53:09] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash2037.codfw.wmnet with OS buster