[00:00:05] brennen and mutante: Your horoscope predicts another unfortunate Phabricator Deployment deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T0000). [00:00:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance [00:00:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance [00:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: maintenance [00:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: maintenance [00:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:21] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2001.codfw.wmnet with reason: maintenance [00:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:24] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2001.codfw.wmnet with reason: maintenance [00:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:20] !log phabricator deploy finished (T311175) [00:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:24] T311175: Deploy Phabricator release/2022-06-22/1 - https://phabricator.wikimedia.org/T311175 [00:14:11] (03CR) 10Dzahn: [C: 03+2] phabricator: get envoy to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [00:14:53] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:21:48] (03CR) 10DDesouza: "Rebased to solve merge conflicts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [00:27:00] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) The UDF in [[ https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/807340 | I3ff40d5b2 ]] can be used to identify web reque... [00:31:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:55] !log end of phabricator maintenance window [00:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:04] (03CR) 10Dzahn: [C: 03+2] "envoy is listening on 0 :::443 now" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [00:42:29] (03CR) 10Juan90264: [C: 03+1] "Looks good to me, now to deploy this change, schedule it at https://wikitech.wikimedia.org/wiki/Deployments, and make sure it's available " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [00:47:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:33] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:54:53] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:59:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:59] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:16:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:49] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:58:36] (03CR) 10Legoktm: [C: 03+1] Remove mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/807078 (owner: 10Muehlenhoff) [02:09:18] (03CR) 10Dzahn: [C: 03+1] "yea, not in use and no special sudo priv lines anyways" [puppet] - 10https://gerrit.wikimedia.org/r/807078 (owner: 10Muehlenhoff) [02:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:10:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:09] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:28] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) a:05RobH→03ayounsi > Dear Customer, > > We have place a loop from the far end 1 hop before Telia towards your side in SG3. > > Kindly check on your end and advise us if we can normalize the loop... [03:02:01] 10SRE, 10API Platform, 10Traffic, 10VisualEditor, and 2 others: Find out if Varnish is messing with ETags, and what to do about it. - https://phabricator.wikimedia.org/T310904 (10ssastry) [04:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:56:01] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:17] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:05:28] 503 [05:05:37] No server is available to handle this request. [05:06:05] Wikidata; someone just reported the same on viwiki [05:06:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [05:06:38] And back, at least for me [05:07:19] (ProbeDown) firing: (16) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:07:19] (ProbeDown) firing: (10) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:14] thanks for the report, looking [05:09:01] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:09:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:09:31] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:45] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:45] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3054.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3054.esams.wmne [05:09:45] 4.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:09:53] PROBLEM - NTP peers on dns3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [05:09:57] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are ma [05:09:57] n but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:09:59] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5007.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: testlb6_443: Serve [05:09:59] 9.eqsin.wmnet, cp5007.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:10:07] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:47] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:53] (JobUnavailable) firing: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:05] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:11:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 2.66 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:11:33] 👀 [05:11:53] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:03] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:12:07] RECOVERY - NTP peers on dns3001 is OK: NTP OK: Offset 3.3e-05 secs https://wikitech.wikimedia.org/wiki/NTP [05:12:07] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:12:18] (ProbeDown) firing: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:12:19] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:12:29] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [05:13:09] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:13:45] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:14:35] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: 5.802e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [05:14:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 6.213e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [05:14:51] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 6.125e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [05:15:03] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: 6.479e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [05:15:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 5.988e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [05:15:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 6.417e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [05:15:29] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 6.208e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [05:15:35] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 6.264e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [05:15:35] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 6.741e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [05:15:45] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5010 is CRITICAL: 6.533e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [05:15:45] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5016 is CRITICAL: 6.512e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [05:15:47] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5004 is CRITICAL: 6.596e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [05:16:01] (JobUnavailable) firing: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:05] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 6.711e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [05:16:29] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 6.789e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [05:16:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: 7.087e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [05:16:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 7.022e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [05:17:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:15] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:18:21] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.488 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:18:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:18:51] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:19:08] Ah. [05:19:28] checking those [05:19:34] we're discussing in another channel, don't worry :) [05:20:21] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.510 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:21:23] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:22:19] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:07] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.496 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:23:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:23:51] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:24:11] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 8.010 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:24:43] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.534 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:25:08] Timed nicely today with when my phone goes off of DND :p [05:26:43] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:26:47] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:27:19] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:09] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [05:28:09] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5011 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.543 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:28:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 78.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:29:05] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:29:05] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.478 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:29:15] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:29:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 244.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [05:29:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 356.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5004 is OK: (C)5000 gt (W)3000 gt 308 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 355 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 387.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 484.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 306.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 305.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [05:29:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 403.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [05:29:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 373.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [05:29:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 371.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [05:29:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5010 is OK: (C)5000 gt (W)3000 gt 326.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [05:30:07] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5016 is OK: (C)5000 gt (W)3000 gt 491.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [05:30:45] (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 523.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [05:30:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 178.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [05:31:35] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [05:31:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [05:31:37] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 304.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [05:32:18] (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:32:19] (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:57] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:41:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:43:05] (03PS1) 10Tim Starling: mcrouter: include "add" command in mw-stats route [puppet] - 10https://gerrit.wikimedia.org/r/807665 (https://phabricator.wikimedia.org/T310662) [05:45:55] (03PS12) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [05:46:38] (03CR) 10Tim Starling: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [05:57:19] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T0600). [06:04:45] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [06:06:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:24] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10TheresNoTime) Many thanks for the report @AlexisJazz — looks like it's recovered now. Think the only publicly actionable item here is going to be adding an incident to https://www.wik... [06:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:11:53] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [06:15:01] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10Marostegui) This should be, indeed, fixed by now. [06:15:05] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:18:59] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [06:23:37] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [06:40:51] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:51:56] apergos: just a courtesy note, I'm not going to be around for today's training (and I note you have someone else to train with a patch regardless) [06:52:12] thanks for the heads up [06:57:27] (03CR) 10Slyngshede: [C: 03+2] profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [07:00:04] Amir1 and apergos: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T0700). Please do the needful. [07:00:04] kostajh: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] morning! [07:01:50] kostajh: you here? [07:02:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:02:52] * TheresNoTime might be able to sit in on the training after all [07:03:40] * urbanecm waves into the channel [07:03:41] well let's see where our patch owner is [07:03:46] hey urbanecm what's up [07:04:09] if you sit in, TheresNoTime, you'll be doing the deployment :-) [07:04:26] I'm just wandering around :) [07:04:40] might as well get your browser tabs and terminal sessions set up just in case, TNT [07:04:46] apergos: doing! :D [07:04:54] good good! [07:05:17] I'll get in the gmeet link then [07:05:17] I pinged Kosta [07:05:21] ty [07:06:13] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-06-22 06:50:48 (3163 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:06:17] I can also babysit the patch in case that's needed, but I prefer if someone else does it :) [07:06:40] nah [07:06:49] we have a trainee who will deploy today [07:06:57] we have the patch owner ... we hope [07:07:07] By babysit I meant test etc, not deploy :)) [07:07:10] I'll be reminding our trainee of the steps [07:07:12] and that's it [07:07:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:07:31] oh instead of teh patch owner? yeah well I dunno, that's for you two to work out :-D [07:08:14] Just got back [07:09:13] ok! we have a trainee today who will be doing your deployment [07:09:22] sorry to be late. the new-ish time doesn't work so well with school drop-offs for me :\ [07:09:24] ok cool! [07:09:24] so we'll just walk through everything step by step [07:09:28] (03PS1) 10Slyngshede: class profile::aptrepo::wikimedia enable mod_macro [puppet] - 10https://gerrit.wikimedia.org/r/807881 [07:10:17] (03PS1) 10Muehlenhoff: Remove LDAP access for mshaver [puppet] - 10https://gerrit.wikimedia.org/r/807882 [07:10:44] let me see if I have +2 over there, our trainee does not [07:10:45] hi kostajh, you're in my hands today :P [07:10:50] I'm reasonably certain my patch won't break things, so it's a good one to train with :) [07:10:51] and we'll get this party started [07:11:03] hm merge conflict I see though [07:11:11] wanna poke that and fix it up? [07:11:14] apergos: just hit rebase [07:11:19] it's needed in config repo :/ [07:11:26] ah ok! [07:11:47] (03PS2) 10ArielGlenn: GrowthExperiments: Enable link recommendations frontend, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806365 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:11:59] doo deed doo dee doo [07:12:59] also if TheresNoTime has deployment access already, we can give them +2 on the repo now, it's automatically approved for those in deployment shell group. [07:13:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36012/console" [puppet] - 10https://gerrit.wikimedia.org/r/807881 (owner: 10Slyngshede) [07:13:25] please do, for the next time :-) [07:13:31] (03CR) 10ArielGlenn: [C: 03+2] GrowthExperiments: Enable link recommendations frontend, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806365 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:14:09] that was my brief cameo, the rest will all be TheresNoTime [07:14:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for mshaver [puppet] - 10https://gerrit.wikimedia.org/r/807882 (owner: 10Muehlenhoff) [07:14:38] {{done}}, TheresNoTime should now have the +2 [07:14:39] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendations frontend, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806365 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:15:01] great, thank you! [07:15:04] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] class profile::aptrepo::wikimedia enable mod_macro [puppet] - 10https://gerrit.wikimedia.org/r/807881 (owner: 10Slyngshede) [07:15:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 25 hosts with reason: Reboots [07:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 25 hosts with reason: Reboots [07:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 22 hosts with reason: Reboots [07:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 22 hosts with reason: Reboots [07:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Reboots [07:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Reboots [07:16:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:17] kostajh: that should be live on mwdebug1002, can you test? [07:17:28] TheresNoTime: yes, having a look [07:17:32] :) [07:18:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:19:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:16] TheresNoTime: looks good to me! [07:21:25] cool, thank you :) [07:22:53] PROBLEM - Check systemd state on apt2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:35] (synchronising now) [07:23:51] (03PS1) 10Muehlenhoff: Remove LDAP access for schang and ldoan [puppet] - 10https://gerrit.wikimedia.org/r/807906 [07:24:25] (03PS1) 10Slyngshede: class profile::aptrepo::wikimedia enable modules headers and ssl. [puppet] - 10https://gerrit.wikimedia.org/r/807907 [07:25:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:21] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806365|GrowthExperiments: Enable link recommendations frontend, round 4 (T304548)]] (duration: 03m 37s) [07:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:26] T304548: Deploy "add a link" to 4th round of wikis - https://phabricator.wikimedia.org/T304548 [07:25:42] kostajh: that should be live now :) [07:25:58] could you test again just to be sure? [07:26:00] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36013/console" [puppet] - 10https://gerrit.wikimedia.org/r/807907 (owner: 10Slyngshede) [07:26:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:26:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] I'm not going to jinx it but its looking okay from this end.. [07:29:50] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for schang and ldoan [puppet] - 10https://gerrit.wikimedia.org/r/807906 (owner: 10Muehlenhoff) [07:30:02] (03PS2) 10Muehlenhoff: Remove mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/807078 [07:30:41] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] class profile::aptrepo::wikimedia enable modules headers and ssl. [puppet] - 10https://gerrit.wikimedia.org/r/807907 (owner: 10Slyngshede) [07:32:30] TheresNoTime: yep lgtm [07:32:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/807078 (owner: 10Muehlenhoff) [07:32:33] thank you! [07:32:38] kostajh: you're welcome! :) [07:33:07] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) a:05ayounsi→03RobH Thanks, no errors there. Please remove the loop and follow up with Telia. [07:33:10] urbanecm: do you think we should backport the refreshLinkRecommendations.php script? There's not a strict need to, but, TheresNoTime, if you would like more deployment training... [07:33:39] I'm talking about this patch specifically https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/807496 [07:34:03] kostajh: no strong opinion. It's possible to run scripts w/o backporting [07:34:45] RECOVERY - Check systemd state on apt2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:47] I guess let's just leave it alone then [07:34:52] this is a two file patch I see with no guarantee about the order the files land [07:34:56] (for a deploy) [07:35:39] Okay, so nothing more to do in this window? :) [07:36:30] !log UTC morning deploys done [07:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:14] * TheresNoTime chalks up one "didn't break everything" deploy \o/ [07:39:16] !log installing firejail security updates [07:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:07] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:43:15] (03PS1) 10Slyngshede: class profile::aptrepo::wikimedia Listen on temporary http ports [puppet] - 10https://gerrit.wikimedia.org/r/807908 [07:43:41] (03Abandoned) 10Slyngshede: class profile::aptrepo::wikimedia Listen on temporary http ports [puppet] - 10https://gerrit.wikimedia.org/r/807908 (owner: 10Slyngshede) [07:44:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 7 hosts with reason: Reboots [07:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 7 hosts with reason: Reboots [07:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:39] (03PS1) 10Slyngshede: class profile::aptrepo::wikimedia Listen on temporary http ports [puppet] - 10https://gerrit.wikimedia.org/r/807909 [07:45:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 9 hosts with reason: Reboots [07:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 9 hosts with reason: Reboots [07:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 14 hosts with reason: Reboots [07:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 14 hosts with reason: Reboots [07:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:41] PROBLEM - MariaDB Replica Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1784.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:46:43] PROBLEM - MariaDB Replica Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1786.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:46:45] PROBLEM - MariaDB Replica Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1788.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:47:25] (03CR) 10Slyngshede: [C: 03+2] class profile::aptrepo::wikimedia Listen on temporary http ports [puppet] - 10https://gerrit.wikimedia.org/r/807909 (owner: 10Slyngshede) [07:49:05] RECOVERY - MariaDB Replica Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:49:07] RECOVERY - MariaDB Replica Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:49:09] RECOVERY - MariaDB Replica Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:53:50] ^ due to reboots [07:56:14] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) I agree that having more levels of NDA wouldn't be a good path forward. My previous comment was towards auditing if there was anything that shouldn't be... [08:00:04] hashar and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T0800). [08:00:25] (03CR) 10JMeybohm: [C: 03+2] mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:06:57] (03CR) 10Ayounsi: [C: 03+1] "That's great!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [08:08:36] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:09:59] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:13:19] (03PS6) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) [08:14:01] (03PS1) 10JMeybohm: helm-state-metrics fix containerPort protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/807913 (https://phabricator.wikimedia.org/T310714) [08:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:16:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2078,2132].codfw.wmnet with reason: Reboots [08:16:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2078,2132].codfw.wmnet with reason: Reboots [08:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2078,2133].codfw.wmnet with reason: Reboots [08:16:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2078,2133].codfw.wmnet with reason: Reboots [08:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2078,2134].codfw.wmnet with reason: Reboots [08:16:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2078,2134].codfw.wmnet with reason: Reboots [08:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2078,2135].codfw.wmnet with reason: Reboots [08:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2078,2135].codfw.wmnet with reason: Reboots [08:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:46] Expect codfw haproxy irc alerts [08:17:52] I am rebooting m* codfw masters [08:19:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 13 hosts with reason: Reboots [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 13 hosts with reason: Reboots [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:51] (03PS1) 10Slyngshede: class profile::aptrepo::wikimedia allow access to autoinstall and more. [puppet] - 10https://gerrit.wikimedia.org/r/807914 [08:22:55] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:23:31] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:23:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 13 hosts with reason: Reboots [08:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:45] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:23:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 13 hosts with reason: Reboots [08:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:11] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:25:43] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:25:57] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:30:01] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10kostajh) @lbowmaker @hnowlan does this service have a page on Wikitech? [08:30:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: Reboots [08:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: Reboots [08:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:39] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics fix containerPort protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/807913 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [08:33:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 13 hosts with reason: Reboots [08:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 13 hosts with reason: Reboots [08:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:02] (03Merged) 10jenkins-bot: helm-state-metrics fix containerPort protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/807913 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [08:39:52] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:12] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] (03PS3) 10Muehlenhoff: bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:40:57] (03PS1) 10Jelto: gitlab_runner/hiera: change gitlab_url for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/807915 [08:42:46] (03CR) 10CI reject: [V: 04-1] bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:43:45] (03CR) 10Jelto: [C: 03+2] gitlab_runner/hiera: change gitlab_url for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/807915 (owner: 10Jelto) [08:44:26] (03CR) 10Slyngshede: [C: 03+2] class profile::aptrepo::wikimedia allow access to autoinstall and more. [puppet] - 10https://gerrit.wikimedia.org/r/807914 (owner: 10Slyngshede) [08:45:33] (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807917 [08:47:47] PROBLEM - Check systemd state on db2134 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:59] (03CR) 10Ayounsi: [C: 03+2] Netbox: add monitoring to dns.git endpoint (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [08:48:28] (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807917 (owner: 10Marostegui) [08:50:01] RECOVERY - Check systemd state on db2134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:23] !log joal@deploy1002 Started deploy [airflow-dags/analytics@b3fe77c]: Small fixes to 2 jobs [08:52:25] (03PS1) 10KartikMistry: Update cxserver to 2022-06-23-052732-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/807919 (https://phabricator.wikimedia.org/T311196) [08:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@b3fe77c]: Small fixes to 2 jobs (duration: 00m 08s) [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:57:20] (03PS1) 10Marostegui: Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807891 [09:00:58] urbanecm: oops, I remembered a patch I wanted to deploy (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805480). But I can just +2 this as it is beta labs only, right? [09:01:01] PROBLEM - MariaDB Replica Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1791.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:01:11] kostajh: yes, and git pull at deployment host [09:01:17] (03CR) 10Kosta Harlan: [C: 03+2] Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [09:01:19] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:41] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:01:41] PROBLEM - MariaDB Replica IO: x1 on db2096 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:02:01] PROBLEM - mysqld processes on db2096 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:02:11] PROBLEM - MariaDB Replica Lag: x1 on db2096 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:02:33] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) Thanks @Dzahn - I'm seeking to relax the permissions on the document, but I've added you speci... [09:02:36] (03Merged) 10jenkins-bot: Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [09:02:38] PROBLEM - MariaDB read only x1 #page on db2096 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:02:39] PROBLEM - MariaDB Replica SQL: x1 on db2096 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:02:45] 10SRE: SRE - https://phabricator.wikimedia.org/T311208 (10M.shaffar19) [09:02:51] urbanecm: in /srv/mediawiki-staging ? [09:02:56] marostegui: ^maintenance? [09:02:58] me [09:02:59] yeah [09:03:02] ok [09:03:04] I wonder why it paged [09:03:04] kostajh: yes [09:03:05] fixing it [09:03:07] PROBLEM - Check systemd state on db2096 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:07] good thing we have a DBA oncall [09:03:22] hello [09:03:36] maybe defaults to read only true? [09:03:40] no idea [09:03:59] RECOVERY - MariaDB Replica IO: x1 on db2096 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:04:17] RECOVERY - mysqld processes on db2096 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:04:23] nah, it is cause it was a master I reckon [09:04:25] all fixed anyways [09:04:27] RECOVERY - MariaDB Replica Lag: x1 on db2096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:04:28] sorry for the noise [09:04:37] urbanecm: ok done [09:04:48] great, np :) [09:04:54] RECOVERY - MariaDB read only x1 #page on db2096 is OK: Version 10.4.25-MariaDB-log, Uptime 92s, read_only: True, event_scheduler: True, 78.80 QPS, connection latency: 0.005104s, query latency: 0.000607s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:04:55] RECOVERY - MariaDB Replica SQL: x1 on db2096 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:05:12] (03CR) 10Muehlenhoff: [C: 03+2] squid/url downloaders: Drop Gopher in ACLs, not used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807094 (owner: 10Muehlenhoff) [09:05:18] then I go back to being ooo :P [09:05:21] RECOVERY - Check systemd state on db2096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:27] Amir1: yes!! [09:05:31] RECOVERY - MariaDB Replica Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:06:11] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:06:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:08:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1178 db1179 db1180 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P29967 and previous config saved to /var/cache/conftool/dbconfig/20220623-090842-root.json [09:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:09:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:40] (03CR) 10Marostegui: [C: 03+2] Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807891 (owner: 10Marostegui) [09:10:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-06-22 06:50:48 (3184 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:13:26] (03PS1) 10Ayounsi: Revert "Netbox: add monitoring to dns.git endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/807892 [09:15:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:15:52] Global rename seems slow, is that any database issue? [09:18:35] MdsShakil: not necessarely, you got any error? [09:19:15] No [09:19:27] what user? [09:22:12] (03PS20) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [09:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29968 and previous config saved to /var/cache/conftool/dbconfig/20220623-092256-root.json [09:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29969 and previous config saved to /var/cache/conftool/dbconfig/20220623-092303-root.json [09:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29970 and previous config saved to /var/cache/conftool/dbconfig/20220623-092310-root.json [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:25:10] (03PS1) 10Ayounsi: Netbox internal dns.git endpoint monitoring fix [puppet] - 10https://gerrit.wikimedia.org/r/807924 (https://phabricator.wikimedia.org/T310831) [09:30:40] (03CR) 10Ayounsi: [C: 03+2] Netbox internal dns.git endpoint monitoring fix [puppet] - 10https://gerrit.wikimedia.org/r/807924 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [09:31:59] (03CR) 10Muehlenhoff: admin: add bmansurov to analytics-research-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761477 (https://phabricator.wikimedia.org/T301215) (owner: 10AOkoth) [09:33:33] (03CR) 10Vgutierrez: "CR looks good, let's test basic functionality in a cloud environment instance" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [09:34:08] (03PS1) 10Btullis: Increase the java heap for the Hadoop namenodes again [puppet] - 10https://gerrit.wikimedia.org/r/807925 (https://phabricator.wikimedia.org/T310293) [09:36:23] (03Abandoned) 10Ayounsi: Revert "Netbox: add monitoring to dns.git endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/807892 (owner: 10Ayounsi) [09:36:29] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:38:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29971 and previous config saved to /var/cache/conftool/dbconfig/20220623-093800-root.json [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29972 and previous config saved to /var/cache/conftool/dbconfig/20220623-093807-root.json [09:38:10] (03CR) 10Faidon Liambotis: Network check MTU report: improve log messages (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556 (owner: 10Ayounsi) [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29973 and previous config saved to /var/cache/conftool/dbconfig/20220623-093814-root.json [09:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:45] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-06-22 08:15:10 (3163 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:45:15] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10akosiaris) >>! In T302870#8021449, @Dzahn wrote: > Before we talk about technical implementation and putting this on ice. I am wondering..has anyone even had spec... [09:45:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:49:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:08] (03PS1) 10Jelto: gitlab_runner: fix whitespace in register and unregister check [puppet] - 10https://gerrit.wikimedia.org/r/807927 [09:53:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29975 and previous config saved to /var/cache/conftool/dbconfig/20220623-095304-root.json [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29976 and previous config saved to /var/cache/conftool/dbconfig/20220623-095311-root.json [09:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29977 and previous config saved to /var/cache/conftool/dbconfig/20220623-095318-root.json [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:17] (03CR) 10Jelto: [C: 03+2] gitlab_runner: fix whitespace in register and unregister check [puppet] - 10https://gerrit.wikimedia.org/r/807927 (owner: 10Jelto) [09:55:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1000). [10:02:39] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:05:15] (03PS2) 10Ayounsi: Network check MTU report: improve log messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556 [10:06:47] (03CR) 10Ayounsi: "Thanks for the review :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556 (owner: 10Ayounsi) [10:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29978 and previous config saved to /var/cache/conftool/dbconfig/20220623-100808-root.json [10:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29979 and previous config saved to /var/cache/conftool/dbconfig/20220623-100815-root.json [10:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29980 and previous config saved to /var/cache/conftool/dbconfig/20220623-100822-root.json [10:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:13:10] (03PS1) 10Jcrespo: restore_media_file: Split restore-media-file cli into it and a library [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807931 (https://phabricator.wikimedia.org/T311215) [10:13:49] (03CR) 10CI reject: [V: 04-1] restore_media_file: Split restore-media-file cli into it and a library [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807931 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [10:14:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:14:57] (03CR) 10Ayounsi: [C: 03+2] Network check MTU report: improve log messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556 (owner: 10Ayounsi) [10:15:47] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:54] (03PS2) 10Jcrespo: restore_media_file: Split restore-media-file cli into it and a library [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807931 (https://phabricator.wikimedia.org/T311215) [10:19:09] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:19:33] (03CR) 10Muehlenhoff: "Some comments from a first pass" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (owner: 10Slyngshede) [10:21:01] !log fix eqiad lvs switch port MTU [10:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:18] (03CR) 10Jcrespo: [C: 03+2] "I am going to merge to HEAD these commits even if not fully finished, given the heavy refactoring done here and on later commits, consider" [software/mediabackups] - 10https://gerrit.wikimedia.org/r/775354 (owner: 10Jcrespo) [10:22:08] (03Merged) 10jenkins-bot: Add functionality to "archiving" older status of a file [software/mediabackups] - 10https://gerrit.wikimedia.org/r/775354 (owner: 10Jcrespo) [10:22:50] (03CR) 10Jcrespo: [C: 03+2] "Another commit that is not yet finished but that is more helpful merged than on review- to complete later." [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802787 (owner: 10Jcrespo) [10:23:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29981 and previous config saved to /var/cache/conftool/dbconfig/20220623-102312-root.json [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29982 and previous config saved to /var/cache/conftool/dbconfig/20220623-102318-root.json [10:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29983 and previous config saved to /var/cache/conftool/dbconfig/20220623-102325-root.json [10:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:29] (03Merged) 10jenkins-bot: MySQLMedia: Add unit testing and small refactorings [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802787 (owner: 10Jcrespo) [10:23:39] (03CR) 10Jcrespo: [C: 03+2] restore_media_file: Split restore-media-file cli into it and a library [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807931 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [10:24:03] jouncebot: next [10:24:03] In 2 hour(s) and 35 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1300) [10:24:04] In 2 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1300) [10:25:15] (03PS7) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [10:25:19] !log running restart-php7.2-fpm A:parsoid or A:mw or A:mw-api to disable opcache revalidation - T266055 [10:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:24] T266055: Update Scap to perform rolling restart for all MW deploy - https://phabricator.wikimedia.org/T266055 [10:26:15] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:32:11] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-06-22 08:15:10 (3184 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:33:03] (03PS35) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [10:37:54] (03CR) 10Volans: [C: 03+1] "LGTM test it on netbox-next if possible." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [10:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29984 and previous config saved to /var/cache/conftool/dbconfig/20220623-103816-root.json [10:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29985 and previous config saved to /var/cache/conftool/dbconfig/20220623-103822-root.json [10:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29986 and previous config saved to /var/cache/conftool/dbconfig/20220623-103829-root.json [10:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:00] (03PS8) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [10:40:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:41:57] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 50.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:43:11] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 33.96 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:44:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 93.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:45:15] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:45:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29987 and previous config saved to /var/cache/conftool/dbconfig/20220623-105320-root.json [10:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29988 and previous config saved to /var/cache/conftool/dbconfig/20220623-105326-root.json [10:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29989 and previous config saved to /var/cache/conftool/dbconfig/20220623-105333-root.json [10:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:55:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "It’s a bit of a code smell that status is now sometimes a str and sometimes a PoolStatus, but with the current code that should be fine: t" [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [11:07:51] (03PS9) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [11:07:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1021 es1024 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P29990 and previous config saved to /var/cache/conftool/dbconfig/20220623-110804-root.json [11:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] (03CR) 10Kosta Harlan: [C: 03+1] MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [11:09:22] (03CR) 10Jbond: [C: 03+2] "thanks , merging" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:10:08] (03Merged) 10jenkins-bot: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:12:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:13:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Blocked at least until I3d2a5ee32a is resubmitted (it had to be reverted)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [11:15:08] (03PS1) 10Ayounsi: Revert "Prometheus: temporarily disable the Netbox job" [puppet] - 10https://gerrit.wikimedia.org/r/807893 [11:15:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:17:37] Heads-up: please note that at 13:30 UTC today, Arzhel and I will be deploying the bird2 upgrade that will affect the following services (anything that uses anycast): Internal recursors, syslog, Wikidough, durum. For more information, see https://phabricator.wikimedia.org/T310574. [11:18:00] Puppet will be disabled and then enabled incrementally: durum, Wikidough, syslog, Internal recursors. [11:18:19] If there are any pages, Arzhel is on-call (and so am I in a way). Thank you. [11:19:28] (03PS1) 10JMeybohm: helm-state-metrics: Enable on all wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/807945 (https://phabricator.wikimedia.org/T310714) [11:19:43] (03CR) 10Jbond: [C: 03+2] deployment-prep: add keyholder agent for scap [puppet] - 10https://gerrit.wikimedia.org/r/804568 (https://phabricator.wikimedia.org/T310354) (owner: 10Hashar) [11:20:26] (03PS2) 10Jbond: Revert "Prometheus: temporarily disable the Netbox job" [puppet] - 10https://gerrit.wikimedia.org/r/807893 (owner: 10Ayounsi) [11:20:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807893 (owner: 10Ayounsi) [11:22:53] (03PS1) 10Ssingh: Add an IPv6 test marker and associated tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/807949 [11:23:02] * kart_ updating cxserver [11:23:19] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-23-052732-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/807919 (https://phabricator.wikimedia.org/T311196) (owner: 10KartikMistry) [11:24:48] (03CR) 10Ssingh: [C: 03+2] Add an IPv6 test marker and associated tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/807949 (owner: 10Ssingh) [11:25:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29991 and previous config saved to /var/cache/conftool/dbconfig/20220623-112524-root.json [11:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29992 and previous config saved to /var/cache/conftool/dbconfig/20220623-112529-root.json [11:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:34] (03Merged) 10jenkins-bot: Update cxserver to 2022-06-23-052732-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/807919 (https://phabricator.wikimedia.org/T311196) (owner: 10KartikMistry) [11:27:33] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:00] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:32] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:37] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) [11:30:13] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:40] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36014/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:31:03] 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10hashar) Looks like we still have the packages installed, if the benchmarking is no more needed maybe they can be removed? ` lang=puppet,name=modules/profile/manifest... [11:31:10] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:50] (03CR) 10Ssingh: [V: 03+1] "The current patchset addresses the backward compatibility for 8.x. If the current approach is acceptable, I will rebase all other patches " [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:31:54] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:09] !log Updated cxserver to 2022-06-23-052732-production (T311196) [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] T311196: cxserver swagger spec shows example spec - https://phabricator.wikimedia.org/T311196 [11:33:33] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 5 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10ItamarWMDE) yes, but if we want it to be picked up next sprint, it should go to the tech backlog column... [11:38:46] (03PS2) 10Tim Starling: mcrouter mw-stats: make other write commands also async [puppet] - 10https://gerrit.wikimedia.org/r/807665 (https://phabricator.wikimedia.org/T310662) [11:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29993 and previous config saved to /var/cache/conftool/dbconfig/20220623-114028-root.json [11:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29994 and previous config saved to /var/cache/conftool/dbconfig/20220623-114033-root.json [11:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:43] (03PS6) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [11:41:52] (03CR) 10Slyngshede: Ganeti Prometheus exporter, initial checkin (0311 comments) [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (owner: 10Slyngshede) [11:41:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1128 db1129 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P29995 and previous config saved to /var/cache/conftool/dbconfig/20220623-114159-root.json [11:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:42] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Enable on all wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/807945 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [11:51:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29996 and previous config saved to /var/cache/conftool/dbconfig/20220623-115104-root.json [11:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29997 and previous config saved to /var/cache/conftool/dbconfig/20220623-115110-root.json [11:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:43] (03Merged) 10jenkins-bot: helm-state-metrics: Enable on all wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/807945 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [11:52:50] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:26] (03PS1) 10Muehlenhoff: Extend access for simone-this-dot [puppet] - 10https://gerrit.wikimedia.org/r/807958 [11:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29998 and previous config saved to /var/cache/conftool/dbconfig/20220623-115532-root.json [11:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29999 and previous config saved to /var/cache/conftool/dbconfig/20220623-115537-root.json [11:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:04] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for simone-this-dot [puppet] - 10https://gerrit.wikimedia.org/r/807958 (owner: 10Muehlenhoff) [11:57:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:58:08] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests [11:59:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests [11:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:13] BGP was me restarting calico [11:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:18] (03PS2) 10Jgiannelos: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 [12:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30000 and previous config saved to /var/cache/conftool/dbconfig/20220623-120608-root.json [12:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30001 and previous config saved to /var/cache/conftool/dbconfig/20220623-120614-root.json [12:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:19] (03PS1) 10Jcrespo: mediabackups: Create querying-only cli utility [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807961 (https://phabricator.wikimedia.org/T311215) [12:07:25] (03PS3) 10Jgiannelos: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 [12:08:13] (03PS4) 10Jgiannelos: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 [12:09:15] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:09:27] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Create querying-only cli utility [software/mediabackups] - 10https://gerrit.wikimedia.org/r/807961 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [12:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30002 and previous config saved to /var/cache/conftool/dbconfig/20220623-121035-root.json [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30003 and previous config saved to /var/cache/conftool/dbconfig/20220623-121041-root.json [12:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:57] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:14:57] (03PS1) 10Jelto: gitlab_runner/hiera: remove registration_token dummy from hiera [puppet] - 10https://gerrit.wikimedia.org/r/807966 [12:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:17:07] (03CR) 10Jelto: [C: 03+2] gitlab_runner/hiera: remove registration_token dummy from hiera [puppet] - 10https://gerrit.wikimedia.org/r/807966 (owner: 10Jelto) [12:18:07] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:20] (03CR) 10Jgiannelos: "Currently codfw is not in use after T306424 (all traffic is served from eqiad). This patch is required in order to bring codfw up to speed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [12:19:07] (03PS5) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) [12:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30004 and previous config saved to /var/cache/conftool/dbconfig/20220623-122112-root.json [12:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30005 and previous config saved to /var/cache/conftool/dbconfig/20220623-122118-root.json [12:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:28] (03PS5) 10Jgiannelos: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) [12:25:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30006 and previous config saved to /var/cache/conftool/dbconfig/20220623-122539-root.json [12:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30007 and previous config saved to /var/cache/conftool/dbconfig/20220623-122545-root.json [12:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:57] !log installing waitress security updates [12:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:22] (03PS1) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [12:30:24] (03PS1) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [12:30:26] (03PS1) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [12:30:28] (03PS1) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [12:30:33] (03PS1) 10Hashar: Boilerplate for automatic MediaWiki deployment [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) [12:31:30] (03CR) 10Jgiannelos: "This patch is part of the followups after the swift/tegola incident T306424. Moving forward codfw/eqiad:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) (owner: 10Jgiannelos) [12:33:02] (03CR) 10CI reject: [V: 04-1] Boilerplate for automatic MediaWiki deployment [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) (owner: 10Hashar) [12:33:25] (03PS6) 10Jgiannelos: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) [12:33:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30008 and previous config saved to /var/cache/conftool/dbconfig/20220623-123616-root.json [12:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30009 and previous config saved to /var/cache/conftool/dbconfig/20220623-123621-root.json [12:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:28] (03CR) 10CI reject: [V: 04-1] redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [12:40:39] (03CR) 10CI reject: [V: 04-1] redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [12:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30010 and previous config saved to /var/cache/conftool/dbconfig/20220623-124043-root.json [12:40:47] (03CR) 10CI reject: [V: 04-1] redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30011 and previous config saved to /var/cache/conftool/dbconfig/20220623-124049-root.json [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:55] (03CR) 10CI reject: [V: 04-1] redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [12:45:24] (03PS4) 10Labdajiwa: Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) [12:45:47] (03PS2) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [12:45:49] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30012 and previous config saved to /var/cache/conftool/dbconfig/20220623-125120-root.json [12:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30013 and previous config saved to /var/cache/conftool/dbconfig/20220623-125125-root.json [12:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:21] (03PS2) 10Hashar: Boilerplate for automatic MediaWiki deployment [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) [12:54:34] (03CR) 10Hashar: Boilerplate for automatic MediaWiki deployment (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807972 (https://phabricator.wikimedia.org/T310395) (owner: 10Hashar) [12:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30014 and previous config saved to /var/cache/conftool/dbconfig/20220623-125547-root.json [12:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30015 and previous config saved to /var/cache/conftool/dbconfig/20220623-125553-root.json [12:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] tegola: Re-enable tile pregeneration on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [12:58:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. @Yannis, would you like help to deploy this or do you already know how?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [12:58:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) (owner: 10Jgiannelos) [12:59:23] (03CR) 10Jgiannelos: tegola: Re-enable tile pregeneration on codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [13:00:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1300). [13:00:05] matthiasmullie, danisztls, and kuncung: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] (03CR) 10Btullis: [C: 03+2] Increase the java heap for the Hadoop namenodes again [puppet] - 10https://gerrit.wikimedia.org/r/807925 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [13:00:17] o/ [13:00:31] I'm here [13:00:39] \o [13:01:20] (03PS5) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 [13:01:53] (03CR) 10Ssingh: Release 9.1.2-1wm1 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [13:02:03] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30016 and previous config saved to /var/cache/conftool/dbconfig/20220623-130624-root.json [13:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30017 and previous config saved to /var/cache/conftool/dbconfig/20220623-130629-root.json [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:41] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:07:55] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:58] I can go ahead and deploy my own patch [13:08:10] I can’t deploy today, sorry [13:08:20] (03PS3) 10Matthias Mullie: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711) [13:08:32] (03CR) 10Matthias Mullie: [C: 03+2] [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [13:09:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:09:47] (03Merged) 10jenkins-bot: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [13:10:41] I can deploy in like 10 mins if no-one else is around [13:10:55] (03CR) 10Ayounsi: [C: 03+2] Revert "Prometheus: temporarily disable the Netbox job" [puppet] - 10https://gerrit.wikimedia.org/r/807893 (owner: 10Ayounsi) [13:11:28] taavi: Yes please [13:13:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:13] matthiasmullie: ping me when you're done with your patch, please? [13:13:25] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:13:27] will do; syncing, almost done [13:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:01] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:09] !log mlitn@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807050|[ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki (T302711)]] (duration: 03m 44s) [13:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:13] T302711: [M] Deploy ImageSuggestions - https://phabricator.wikimedia.org/T302711 [13:15:29] taavi: done, the floor is yours [13:15:47] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:17:10] (03PS1) 10Slyngshede: class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 [13:17:24] cool, thanks [13:17:46] (03CR) 10CI reject: [V: 04-1] class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede) [13:17:59] (03CR) 10Jgiannelos: [C: 03+2] tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) (owner: 10Jgiannelos) [13:18:05] (03PS6) 10Majavah: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [13:18:40] (03CR) 10Jgiannelos: [C: 03+2] tegola: Re-enable tile pregeneration on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [13:18:54] (03PS2) 10Slyngshede: class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 [13:19:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:12] (03CR) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [13:20:25] (03PS2) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [13:20:42] (03CR) 10Majavah: [C: 03+2] QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [13:20:43] danisztls: your patch seems to be for the beta cluster only, so I'm going to just merge it, it'll take something like 15-30 mins for it to be automatically deployed (ping me if it does not) [13:20:58] taavi: thanks! [13:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:21:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30018 and previous config saved to /var/cache/conftool/dbconfig/20220623-132128-root.json [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30019 and previous config saved to /var/cache/conftool/dbconfig/20220623-132133-root.json [13:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:37] (03Merged) 10jenkins-bot: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [13:22:10] (03PS5) 10Majavah: Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [13:22:40] (03CR) 10Majavah: [C: 03+2] Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [13:22:43] (03Merged) 10jenkins-bot: tegola: Re-enable tile pregeneration on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) (owner: 10Jgiannelos) [13:22:45] (03Merged) 10jenkins-bot: tegola: Point codfw to a new swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 (https://phabricator.wikimedia.org/T307184) (owner: 10Jgiannelos) [13:23:34] (03Merged) 10jenkins-bot: Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [13:23:36] kuncung: do you have the x-wikimedia-debug browser extension installed? [13:24:06] (03CR) 10ArielGlenn: [C: 04-1] "Thanks for the work and the fixups, there's one fundamental change here that Hannah and I noticed, which is that in the current version of" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:24:16] taavi: Yes, I have [13:24:29] ok, please test your change on mwdebug1001 [13:24:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:06] Okay. A moment please [13:25:13] sure [13:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30020 and previous config saved to /var/cache/conftool/dbconfig/20220623-132729-root.json [13:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:54] !log disable puppet on A:durum or A:wikidough or A:centrallog or A:dns-rec: deploying T310574 [13:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:58] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [13:28:53] sukhe: [13:28:59] sukhe: was about to suggest: [13:29:02] https://www.irccloud.com/pastebin/FLGLdPp1/ [13:29:23] oh yeah sure, that's a better regex [13:29:29] I will add the authdns ones too [13:29:37] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182 db1184 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30021 and previous config saved to /var/cache/conftool/dbconfig/20220623-132951-root.json [13:29:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:59] sukhe: I'll let you take care of disabling puppet [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:03] XioNoX: all done [13:30:08] cool [13:30:09] on the 40 hosts above [13:30:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:30:15] taavi: Checked all of them and they are looking good! :) [13:30:15] going to merge it now. OK? [13:30:22] kuncung: great! syncing them now [13:30:24] sukhe: godspeed! [13:30:43] * sukhe deep breath [13:30:46] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:30:47] sukhe: which host will you deploy it on first? [13:30:51] durum1001 [13:30:53] eqiad [13:30:55] ok [13:30:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:02] taavi: Tysm [13:31:10] happy to help [13:31:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:31:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:17] Jun 23 13:32:56 durum1001 birdc[21007]: BIRD 2.0.7 ready. [13:33:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30022 and previous config saved to /var/cache/conftool/dbconfig/20220623-133358-root.json [13:33:59] we will need to remove bird6.conf manually since we didn't absent it but that's OK and not for now [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:06] yep [13:34:09] !log taavi@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:807247|Add wordmark and tagline for jvwiki, jvwikt, and jvws (T311104)]] (1/2) (duration: 03m 37s) [13:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:14] T311104: Add wordmark and tagline for Javanese Wikipedia, Wiktionary, and Wikisource - https://phabricator.wikimedia.org/T311104 [13:34:27] sukhe: both v4 and v6 is established on cr1, let me check more [13:34:47] XioNoX: thank you, looking good here as well. running knead-wikidough durum tests [13:35:06] sukhe: it's not advertising any prefixes [13:35:15] checking [13:35:18] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) 05Open→03Resolved [13:35:24] (03PS2) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [13:35:41] yep, something's up. I am hitting codfw now [13:35:44] checking [13:36:01] Active prefixes: 0 [13:36:01] Received prefixes: 4 [13:36:01] Accepted prefixes: 4 [13:36:27] (03PS2) 10Volans: Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 [13:36:29] (03PS1) 10Volans: reports.network: improve IPv6 AAAA records checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 [13:36:38] Loaded: not-found (Reason: Unit anycast-healthchecker.service not found.) [13:36:41] ha [13:36:53] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:38:13] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807247|Add wordmark and tagline for jvwiki, jvwikt, and jvws (T311104)]] (2/2) (duration: 03m 26s) [13:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:19] kuncung: ok, all done! [13:38:25] anyone have anything else to deploy? [13:38:30] sukhe: how is that possible? [13:38:45] like how did it get remove? [13:38:55] yeah it doesn't make sense, the PCC also doesn't reflect that [13:38:59] but clearly, at least that's the issue [13:39:01] checking [13:39:23] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [13:39:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30023 and previous config saved to /var/cache/conftool/dbconfig/20220623-133928-root.json [13:39:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30024 and previous config saved to /var/cache/conftool/dbconfig/20220623-133931-root.json [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:35] sukhe: maybe the package is tied to it? [13:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:47] removing "bird" removed anycast-healthchecker [13:39:49] PROBLEM - Wikidough durum Check -IPv6- on durum2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:01] PROBLEM - Wikidough durum Check -IPv6- on durum1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:11] PROBLEM - Wikidough durum Check -IPv6- on durum4001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:11] PROBLEM - Wikidough durum Check -IPv6- on durum5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:19] PROBLEM - Wikidough durum Check -IPv6- on durum2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:19] woh why [13:40:23] PROBLEM - Wikidough durum Check -IPv6- on durum1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:24] we disabled puppet [13:40:27] PROBLEM - Wikidough durum Check -IPv6- on durum6001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:40:35] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:41:16] anycast-healthchecker : Depends: bird but it is not going to be installed [13:41:21] PROBLEM - Wikidough durum Check -IPv6- on durum4002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:41:21] PROBLEM - Wikidough durum Check -IPv6- on durum5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:41:21] PROBLEM - Wikidough durum Check -IPv6- on durum6002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:41:31] PROBLEM - Wikidough durum Check -IPv6- on durum3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:41:37] PROBLEM - Wikidough durum Check -IPv6- on durum3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Durum [13:41:43] sukhe: I confirm puppet is disabled [13:42:10] ok so [13:42:15] do you know what the check is checking exactly? [13:42:20] we will need to remove the anycast-hc dependency on bird [13:42:22] and update it to bird2 [13:42:48] sukhe: yep [13:42:59] XioNoX: the errors above? it's just a simple connectivity check [13:43:41] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:44:39] (03CR) 10Volans: [C: 03+2] Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans) [13:44:42] Hey taavi are you still running backports? [13:44:45] (03PS2) 10Volans: netbox::host: rename cluster to ganeti_cluster [puppet] - 10https://gerrit.wikimedia.org/r/807546 [13:45:03] (no worries if not) [13:45:11] Jdlrobson: you have a patch? [13:45:14] (03PS1) 10Jdlrobson: Skin: Change viewport based on feedback [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807900 (https://phabricator.wikimedia.org/T311119) [13:45:19] yeh im up early and this is a train blocker^ [13:45:20] sukhe: the package stuff is not something we can fix right away, right? so we should rollback? [13:45:24] so I figured I'd try my luck. [13:45:24] yep [13:45:27] (03Merged) 10jenkins-bot: Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans) [13:45:31] jouncebot: next [13:45:31] In 2 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1600) [13:45:33] given that this seems to be the only issue so far [13:45:43] I think we should revert, fix this and then come back. we can't proceed anyway :) [13:45:51] yeah I'm trying to have a look at the icinga alert [13:45:59] we might go a bit overtime, but that should be fine since there's nothing afterwards [13:46:06] ok great! I'll stick it in the calendar [13:46:13] (03CR) 10Majavah: [C: 03+2] "deploying" [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807900 (https://phabricator.wikimedia.org/T311119) (owner: 10Jdlrobson) [13:46:15] XioNoX: I am going to revert it for now and get to fixing it. agreed? [13:46:16] thanks! [13:46:43] sukhe: yep [13:46:51] cool! [13:47:17] (03PS1) 10Ssingh: Revert "bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations)" [puppet] - 10https://gerrit.wikimedia.org/r/807901 [13:47:27] ok so the check does check_tcp_ssl!2001:67c:930::2!443 [13:47:45] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:25] (03PS1) 10Jelto: gitlab_runner/hiera: make gitlab-runner-1003 a Trusted Runner [puppet] - 10https://gerrit.wikimedia.org/r/807987 [13:48:33] 10SRE, 10Wikimedia-Mailing-lists: WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Aklapper) > This is a new international campaign in meta Hi, could you please provide a link to that campaign? Are initiatives like https://meta.wikimedia.org/wiki/Wikispeech or https://meta.wikimed... [13:48:39] (03CR) 10Ssingh: [C: 03+2] Revert "bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations)" [puppet] - 10https://gerrit.wikimedia.org/r/807901 (owner: 10Ssingh) [13:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30025 and previous config saved to /var/cache/conftool/dbconfig/20220623-134902-root.json [13:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:29] ACKNOWLEDGEMENT - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service Btullis T310293 restarted the systemd unit https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:53] 10SRE, 10Wikimedia-Mailing-lists: WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Aklapper) a:05Anasskoko→03None Removing assignee as I assume that you will not create the mailing list yourself. [13:49:59] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:04] ^ expected will be fixed shortly [13:54:11] ok so we build the anycast-healthchecker Deb package [13:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30026 and previous config saved to /var/cache/conftool/dbconfig/20220623-135432-root.json [13:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30027 and previous config saved to /var/cache/conftool/dbconfig/20220623-135435-root.json [13:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:37] (03CR) 10Jelto: [C: 03+2] gitlab_runner/hiera: make gitlab-runner-1003 a Trusted Runner [puppet] - 10https://gerrit.wikimedia.org/r/807987 (owner: 10Jelto) [13:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:55:25] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:56:15] (03PS1) 10DDesouza: QuickSurveys: Enable extension on 'jawiki' on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807989 [13:56:28] sukhe: ok, I found the issue [13:56:40] sukhe: bgp session to durum1002 is down [13:56:49] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:56:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:08] XioNoX: yep, reverting the change led to some issues as expected, since we had bird2 and now reverting to bird [13:57:15] PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:57:16] the issue you mean with anycast-hc or durum1002? [13:57:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:57:23] taavi: can you review/merge this one? extension wasn't enabled on the target wiki (beta) [13:57:28] sukhe: I mean did you push the change to durum1002 ? [13:57:29] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:57:33] yep [13:57:35] ah ok [13:57:36] I was trying to check [13:57:39] sorry for the confusion [13:58:01] re: the anycast-hc, the fix is pretty clear unless I am mistaken, we add Depends on bird2 instead of bird [13:58:04] nevermind then, there is another follow up we can discuss, related to icinga, but once everything is back to normal [13:58:04] and rebuild [13:58:06] (03PS2) 10Majavah: QuickSurveys: Enable extension on 'jawiki' on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807989 (owner: 10DDesouza) [13:58:07] !log import jenkins 2.346.1 to thirdparty/ci T311174 [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:12] T311174: Upgrade Jenkins to latest LTS 2.346.1 - https://phabricator.wikimedia.org/T311174 [13:58:16] sukhe: yep [13:58:31] (03CR) 10Majavah: [C: 03+2] QuickSurveys: Enable extension on 'jawiki' on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807989 (owner: 10DDesouza) [13:58:35] danisztls: sure [13:59:12] (03CR) 10Volans: [C: 03+2] "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/807546 (owner: 10Volans) [13:59:14] (03Merged) 10jenkins-bot: QuickSurveys: Enable extension on 'jawiki' on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807989 (owner: 10DDesouza) [13:59:22] taavi: thanks! [14:00:22] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update locations - volans@cumin1001" [14:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] !log volans@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Update locations - volans@cumin1001" [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:21] (03Merged) 10jenkins-bot: Skin: Change viewport based on feedback [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807900 (https://phabricator.wikimedia.org/T311119) (owner: 10Jdlrobson) [14:02:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:18] Jdlrobson: can you test on mwdebug1001 please? [14:02:29] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update locations - volans@cumin1001" [14:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:35] !log volans@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Update locations - volans@cumin1001" [14:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:03:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:18] XioNoX: I am going to re-enable Puppet on the other hosts for now [14:03:30] no issues with reverting, unless the change was applied [14:03:38] and then we can go and fix anycast-hc. sounds ifne? [14:03:42] (03CR) 10Muehlenhoff: [C: 03+2] vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:03:52] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:03:53] sukhe: sounds good, do you want to run it manually on a host just in case? [14:03:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:05] re-enabled puppet manually I mean [14:04:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30028 and previous config saved to /var/cache/conftool/dbconfig/20220623-140406-root.json [14:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:12] Jdlrobson: {{ping}} [14:04:17] XioNoX: yep, I tried with doh2001 too [14:04:23] let me try one more, just to be extra sure [14:04:42] great [14:05:18] https://github.com/unixsurfer/anycast_healthchecker/blob/master/debian/control#L33 [14:05:37] yeah ha [14:05:45] quite the oversight :) [14:05:59] taavi: all good here [14:06:01] sukhe: https://github.com/unixsurfer/anycast_healthchecker/issues/29#issue-1049577488 [14:06:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:06:24] great, syncin [14:06:31] thank you! [14:06:32] :] [14:07:47] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30029 and previous config saved to /var/cache/conftool/dbconfig/20220623-140936-root.json [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30030 and previous config saved to /var/cache/conftool/dbconfig/20220623-140939-root.json [14:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:56] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update locations - volans@cumin1001" [14:09:56] !log volans@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Update locations - volans@cumin1001" [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.17/includes/skins/Skin.php: Backport: [[gerrit:807900|Skin: Change viewport based on feedback (T311119)]] (duration: 03m 29s) [14:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:07] T311119: All Wikimedia projects other than Wikipedia, Test Wikipedia, and Wikinews extremely zoomed out while using Vector Skin on iPad. - https://phabricator.wikimedia.org/T311119 [14:10:10] done! [14:11:17] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:21] thanks taavi for fitting me in! [14:11:31] no worries [14:12:47] 10SRE, 10Wikimedia-Mailing-lists: WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Anasskoko) Hello Aklapper, Thank you for responding so quickly. The answer is No, this campaign is entirely not The same with WikiSpeech or IPA Audio render. Below is the link for the campaign. I... [14:12:55] (03PS1) 10Jelto: gitlab_runner/hiera: change docker volume size in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/807991 [14:13:07] PROBLEM - DPKG on durum1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:13:13] ^ fixing [14:13:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:15:41] (03CR) 10Jelto: [C: 03+2] gitlab_runner/hiera: change docker volume size in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/807991 (owner: 10Jelto) [14:17:51] (03CR) 10Volans: [C: 03+1] "LGTM, nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [14:17:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30031 and previous config saved to /var/cache/conftool/dbconfig/20220623-141910-root.json [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:27] RECOVERY - Wikidough durum Check -IPv6- on durum5002 is OK: TCP OK - 0.005 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:19:27] RECOVERY - Wikidough durum Check -IPv6- on durum4002 is OK: TCP OK - 0.005 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:19:27] RECOVERY - Wikidough durum Check -IPv6- on durum6002 is OK: TCP OK - 0.006 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:19:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:19:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:37] RECOVERY - Wikidough durum Check -IPv6- on durum3001 is OK: TCP OK - 0.006 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:45] RECOVERY - Wikidough durum Check -IPv6- on durum3002 is OK: TCP OK - 0.008 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:15] RECOVERY - Wikidough durum Check -IPv6- on durum2002 is OK: TCP OK - 0.007 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:23] RECOVERY - Wikidough durum Check -IPv6- on durum1002 is OK: TCP OK - 0.014 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:31] RECOVERY - Wikidough durum Check -IPv6- on durum4001 is OK: TCP OK - 0.005 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:31] RECOVERY - Wikidough durum Check -IPv6- on durum5001 is OK: TCP OK - 0.005 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:37] RECOVERY - Wikidough durum Check -IPv6- on durum2001 is OK: TCP OK - 0.007 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:43] RECOVERY - Wikidough durum Check -IPv6- on durum1001 is OK: TCP OK - 0.008 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:20:49] RECOVERY - Wikidough durum Check -IPv6- on durum6001 is OK: TCP OK - 0.005 second response time on 2001:67c:930::2 port 443 https://wikitech.wikimedia.org/wiki/Durum [14:21:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:23:48] (03PS2) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [14:24:01] RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:24:11] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:24:13] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:24:19] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:24:35] (03PS36) 10Jbond: sre.hardware.dell: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [14:24:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30032 and previous config saved to /var/cache/conftool/dbconfig/20220623-142440-root.json [14:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30033 and previous config saved to /var/cache/conftool/dbconfig/20220623-142443-root.json [14:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:49] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:25:57] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:26:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:27:10] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.5/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10MoritzMuehlenhoff) [14:27:12] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) [14:28:02] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Update CAS to 6.5 - https://phabricator.wikimedia.org/T311235 (10MoritzMuehlenhoff) [14:29:10] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Update CAS to 6.5 - https://phabricator.wikimedia.org/T311235 (10MoritzMuehlenhoff) cas 6.5.5 has been built and uploaded to apt.wikimedia.org. It's currently installed on idp-test.wikimedia.org and functionality is working fine. The WMF-specific theming needs... [14:29:56] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) [14:30:22] (03PS1) 10Muehlenhoff: modify-mfa: Also allow mfa-webauthn [puppet] - 10https://gerrit.wikimedia.org/r/807994 (https://phabricator.wikimedia.org/T311236) [14:30:29] (03PS1) 10Ottomata: eventlogging - fix eventlogging_schemas_disabled list in plugins.py [puppet] - 10https://gerrit.wikimedia.org/r/807995 [14:30:40] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update locations - volans@cumin1001" [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:20] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update locations - volans@cumin1001" [14:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @RobH this may be a controller issue, the servers were able to go through the installation without any issue, after the install, th... [14:32:48] (03CR) 10Ottomata: [C: 03+2] eventlogging - fix eventlogging_schemas_disabled list in plugins.py [puppet] - 10https://gerrit.wikimedia.org/r/807995 (owner: 10Ottomata) [14:34:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30034 and previous config saved to /var/cache/conftool/dbconfig/20220623-143414-root.json [14:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:33] puppet broken on VMs it's me, fixing, sorry about that [14:34:49] ahhh okay was about to ask [14:34:50] (03PS1) 10Jbond: netbox: fix type definitions [puppet] - 10https://gerrit.wikimedia.org/r/807996 [14:34:51] ty [14:34:56] !log on going PDU maintenance in rack A3 codfw [14:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:32] (03PS1) 10Volans: Netbox location: fix naming [puppet] - 10https://gerrit.wikimedia.org/r/807997 [14:35:35] (03CR) 10Volans: [C: 03+2] netbox: fix type definitions [puppet] - 10https://gerrit.wikimedia.org/r/807996 (owner: 10Jbond) [14:35:56] (03CR) 10CI reject: [V: 04-1] Netbox location: fix naming [puppet] - 10https://gerrit.wikimedia.org/r/807997 (owner: 10Volans) [14:36:39] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:37:23] (03PS1) 10Ssingh: package_builder: add python3-pbr (anycast-healthchecker build) [puppet] - 10https://gerrit.wikimedia.org/r/807998 [14:37:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:38:25] (03PS1) 10Volans: Netbox type: fix data type [puppet] - 10https://gerrit.wikimedia.org/r/807999 [14:38:27] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02327 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:38:44] (03CR) 10Volans: [C: 03+2] Netbox type: fix data type [puppet] - 10https://gerrit.wikimedia.org/r/807999 (owner: 10Volans) [14:39:01] widespread failures is also me... fix coming [14:39:35] (03CR) 10Muehlenhoff: [C: 03+2] modify-mfa: Also allow mfa-webauthn [puppet] - 10https://gerrit.wikimedia.org/r/807994 (https://phabricator.wikimedia.org/T311236) (owner: 10Muehlenhoff) [14:39:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30035 and previous config saved to /var/cache/conftool/dbconfig/20220623-143944-root.json [14:39:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30036 and previous config saved to /var/cache/conftool/dbconfig/20220623-143946-root.json [14:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:18] (03CR) 10Ssingh: "This was on build2001.codfw.wmnet FWIW if it matters!" [puppet] - 10https://gerrit.wikimedia.org/r/807998 (owner: 10Ssingh) [14:41:36] puppet is fixed, I'll fix the motd later on [14:41:41] to be more precise [14:43:53] ty [14:44:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @BTullis I don't have any real guidance for you other than all disks are controlled by the raid controller. Partman recipes are not a specialty of mine. pinging @robh... [14:44:15] (03CR) 10Majavah: [C: 03+2] Remove stretch support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/807184 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [14:44:29] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:45:28] (03Merged) 10jenkins-bot: Remove stretch support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/807184 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [14:47:26] (03PS1) 10Lucas Werkmeister (WMDE): Do not re-use "wikibase_config" for registering the language selector... [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807902 (https://phabricator.wikimedia.org/T307869) [14:47:45] jouncebot: now [14:47:45] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [14:47:53] (Memory over 85%) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [14:48:07] I’ll deploy that backport ^ now, it shouldn’t have any effect yet (fyi dcausse) [14:48:14] then we can retry the corresponding config change next week [14:48:46] o/ [14:49:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30037 and previous config saved to /var/cache/conftool/dbconfig/20220623-144918-root.json [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Do not re-use "wikibase_config" for registering the language selector... [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807902 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [14:53:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:53:47] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Aklapper) [14:54:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30038 and previous config saved to /var/cache/conftool/dbconfig/20220623-145448-root.json [14:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30039 and previous config saved to /var/cache/conftool/dbconfig/20220623-145450-root.json [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:17] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:55:45] (03PS1) 10Majavah: d/changelog: add changelog entry for 0.87 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/808005 [14:56:09] (03CR) 10Majavah: [C: 03+2] d/changelog: add changelog entry for 0.87 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/808005 (owner: 10Majavah) [14:57:24] (03Merged) 10jenkins-bot: d/changelog: add changelog entry for 0.87 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/808005 (owner: 10Majavah) [14:58:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:59:23] PROBLEM - IPMI Sensor Status on mw2292 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:23] PROBLEM - IPMI Sensor Status on mw2396 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:01:40] (03CR) 10Joal: eventlogging - fix eventlogging_schemas_disabled list in plugins.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807995 (owner: 10Ottomata) [15:02:53] (Memory over 85%) resolved: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [15:03:04] jouncebot: now [15:03:04] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [15:04:21] PROBLEM - IPMI Sensor Status on mw2393 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30040 and previous config saved to /var/cache/conftool/dbconfig/20220623-150422-root.json [15:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:34] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) Status update: With a hacked-up config on idp-test.w.o and when configuring a user to pass mfa-webauthn to the Groovy script I'm gettin... [15:04:43] PROBLEM - IPMI Sensor Status on mw2298 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:46] (03Merged) 10jenkins-bot: Do not re-use "wikibase_config" for registering the language selector... [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/807902 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:06:33] WikibaseCirrusSearch change is on mwdebug1001, testing it a bit [15:06:55] I will restart the CI Jenkins once you are done ;) [15:07:39] PROBLEM - IPMI Sensor Status on mw2297 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:07] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605111 seconds, message: Andrew keeping labtestwikitech switched off until we can safely restart it., last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:08:11] PROBLEM - IPMI Sensor Status on mw2397 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:43] PROBLEM - IPMI Sensor Status on mw2296 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:09:29] PROBLEM - IPMI Sensor Status on es2020 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:09:52] papaul: ^ anything being done? [15:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30041 and previous config saved to /var/cache/conftool/dbconfig/20220623-150951-root.json [15:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30042 and previous config saved to /var/cache/conftool/dbconfig/20220623-150954-root.json [15:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:22] (03PS1) 10Muehlenhoff: Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/808010 [15:10:50] papaul: they all belong to A3 [15:10:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10Jclark-ctr) conf1007 A1 U17 port17 cableid2907 conf1008 B1 U22 port32 cableid2013339101789 conf1009 D3 U40 port45 cableid23000027 [15:10:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:11:09] PROBLEM - IPMI Sensor Status on db2089 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:26] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/WikibaseCirrusSearch/src/Hooks.php: Backport: [[gerrit:807902|Do not re-use "wikibase_config" for registering the language selector... (T307869)]] (duration: 03m 22s) [15:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [15:11:46] hashar: go ahead (as far as I’m concerned, at least) [15:12:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:12:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807998 (owner: 10Ssingh) [15:13:32] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for tandic [puppet] - 10https://gerrit.wikimedia.org/r/808010 (owner: 10Muehlenhoff) [15:13:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:15:10] Lucas_WMDE: thanks ) [15:15:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005058 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:17:27] PROBLEM - IPMI Sensor Status on mw2291 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:17:28] !log Upgrading CI Jenkins # T311174 [15:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:33] T311174: Upgrade Jenkins to latest LTS 2.346.1 - https://phabricator.wikimedia.org/T311174 [15:17:57] PROBLEM - IPMI Sensor Status on mw2294 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:18:08] (03PS1) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808011 (https://phabricator.wikimedia.org/T307869) [15:19:21] 10ops-codfw: codfw: A3 hosts reporting Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] - https://phabricator.wikimedia.org/T311245 (10Marostegui) [15:20:04] 10ops-codfw: codfw: A3 hosts reporting Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] - https://phabricator.wikimedia.org/T311245 (10Marostegui) 05Open→03Invalid This was due to a planned maintenance https://phabricator.wikimedia.org/T309957 [15:20:30] (03CR) 10Jbond: [C: 03+1] reports.network: improve IPv6 AAAA records checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [15:20:51] (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for Monday. Should not be deployed before the wmf.17 train is fully rolled out, because I didn’t bother backporting the fix to w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808011 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [15:21:02] (03PS3) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [15:21:08] (03CR) 10Majavah: [C: 03+1] "I've been testing this for a few days in the toolsbeta cloud vps project since it makes https://phabricator.wikimedia.org/T284767 much eas" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [15:21:27] (03CR) 10Ssingh: [C: 03+2] package_builder: add python3-pbr (anycast-healthchecker build) [puppet] - 10https://gerrit.wikimedia.org/r/807998 (owner: 10Ssingh) [15:22:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10Jclark-ctr) [15:22:40] PROBLEM - IPMI Sensor Status on mw2299 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:22:48] PROBLEM - IPMI Sensor Status on mw2295 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:23:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:23:30] PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:24:46] (03CR) 10Jbond: redfish: add a fqdn getter property and __str__ method (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [15:25:49] (03PS1) 10JMeybohm: kubernetes::master: Double the apiserver latency thresholds [puppet] - 10https://gerrit.wikimedia.org/r/808012 (https://phabricator.wikimedia.org/T310714) [15:25:56] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:26:23] (03PS1) 10Jcrespo: Add new script delete-media-file to delete backed up files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808013 (https://phabricator.wikimedia.org/T311215) [15:26:58] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Eevans) >>! In T304891#8022090, @kostajh wrote: > @lbowmaker @hnowlan does this service have a page on Wikitech... [15:27:10] PROBLEM - IPMI Sensor Status on mw2399 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:28:02] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) [15:28:12] RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:28:26] 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Eevans) >>! In T230178#8022665, @hashar wrote: > Looks like we still have the packages installed, if the benchmarking is no more needed maybe they can be removed? >... [15:28:34] PROBLEM - IPMI Sensor Status on mw2400 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:29:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::master: Double the apiserver latency thresholds [puppet] - 10https://gerrit.wikimedia.org/r/808012 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:29:07] (03PS2) 10Jcrespo: Add new script delete-media-file to delete backed up files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808013 (https://phabricator.wikimedia.org/T311215) [15:31:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:31:07] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10jbond) >but that bails out with a bean error related to the fasterxml parser, Wonder if this is related to the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/sof... [15:32:04] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Double the apiserver latency thresholds [puppet] - 10https://gerrit.wikimedia.org/r/808012 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:34:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:35:35] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) >>! In T311236#8023420, @jbond wrote: >>but that bails out with a bean error related to the fasterxml parser, > Wonder if this is related to the [[ https://... [15:39:10] PROBLEM - IPMI Sensor Status on mw2293 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:40:14] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:40:32] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Updated SG3 to remove the loopback and return the circuit to service, sent email reply to Arelion support thread to request next steps since the cross connection tested fine. [15:41:10] RECOVERY - IPMI Sensor Status on db2089 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:42:48] (Memory over 85%) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [15:44:30] PDU swap complete in Rack A3 codfw if you see in servers still having issues please ping me thanks [15:47:32] (03PS1) 10JMeybohm: helm-state-metrics: Resource headroom for bigger clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/808019 (https://phabricator.wikimedia.org/T310714) [15:47:58] RECOVERY - IPMI Sensor Status on mw2291 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:48:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Resource headroom for bigger clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/808019 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:48:24] RECOVERY - IPMI Sensor Status on mw2294 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:51:24] (03Merged) 10jenkins-bot: helm-state-metrics: Resource headroom for bigger clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/808019 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:52:21] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:52:59] RECOVERY - IPMI Sensor Status on mw2299 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:53:01] RECOVERY - IPMI Sensor Status on mw2397 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:53:01] RECOVERY - IPMI Sensor Status on mw2295 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:54:12] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:28] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] RECOVERY - IPMI Sensor Status on mw2399 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:57:46] (03CR) 10Ottomata: [C: 03+2] eventlogging - fix eventlogging_schemas_disabled list in plugins.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807995 (owner: 10Ottomata) [15:58:49] RECOVERY - IPMI Sensor Status on mw2400 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:59:07] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:23] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:32] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:01] RECOVERY - IPMI Sensor Status on mw2292 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:00:01] RECOVERY - IPMI Sensor Status on mw2396 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:30] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:48] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:03:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:03:56] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:05:17] (03PS1) 10Papaul: Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) [16:05:29] RECOVERY - IPMI Sensor Status on mw2393 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:05:45] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:05:47] RECOVERY - IPMI Sensor Status on mw2298 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:51] (03CR) 10CI reject: [V: 04-1] Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:07:32] (03CR) 10Cwhite: [C: 03+2] opensearch: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/803588 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [16:07:35] 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) @Krinkle AIUI the OAuth 1 spec stipulates that parameters be normalized prior to computing a signature, so that should be OK. Not sure about 2.0. [16:08:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:11] RECOVERY - IPMI Sensor Status on mw2297 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:10:05] RECOVERY - IPMI Sensor Status on mw2293 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:10:05] RECOVERY - IPMI Sensor Status on mw2296 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:10:47] RECOVERY - IPMI Sensor Status on es2020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:11:33] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:12:49] (03PS1) 10Ottomata: eventlogging - Fix another comma typo in plugins.py [puppet] - 10https://gerrit.wikimedia.org/r/808026 [16:13:08] (03PS2) 10Ottomata: eventlogging - Fix another comma typo in plugins.py [puppet] - 10https://gerrit.wikimedia.org/r/808026 [16:15:00] (03PS2) 10Papaul: Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) [16:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:15:36] (03CR) 10CI reject: [V: 04-1] Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:16:17] RECOVERY - Check systemd state on thanos-fe1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:21] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:19:30] (03CR) 10Ottomata: [C: 03+2] eventlogging - Fix another comma typo in plugins.py [puppet] - 10https://gerrit.wikimedia.org/r/808026 (owner: 10Ottomata) [16:20:54] (03CR) 10Ottomata: [C: 03+2] eventlogging - fix eventlogging_schemas_disabled list in plugins.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807995 (owner: 10Ottomata) [16:21:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:55] (Device rebooted) firing: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:23:16] (03PS1) 10JMeybohm: helm-state-metrics: Prevent heavy throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/808028 (https://phabricator.wikimedia.org/T310714) [16:24:49] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Prevent heavy throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/808028 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [16:25:06] (03PS3) 10Papaul: Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) [16:26:20] (03CR) 10Papaul: [C: 03+2] Add new model to ps1-a3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/808023 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:27:55] (Device rebooted) resolved: Device ps1-a3-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:30:55] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:16] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:37] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:59] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:12] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:19] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:32] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:41] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:45:51] (03CR) 10Ayounsi: "Small steps towards v6 consistency, nice :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [16:47:03] (03CR) 10Jdlrobson: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [16:48:21] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:53:33] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:55:37] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:56:45] (03PS1) 10DDesouza: QuickSurveys (beta): Fix typo and deploy to 'enwiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808035 (https://phabricator.wikimedia.org/T311015) [16:57:13] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:58:23] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:01:48] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [17:02:59] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:04:07] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:10:03] (03PS1) 10Herron: swift: update ephemeral port range from 1024-65535 to 10240-65535 [puppet] - 10https://gerrit.wikimedia.org/r/808040 [17:10:41] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:29] (03PS1) 10Ssingh: Revert "durum: add monitoring::service for the check service" [puppet] - 10https://gerrit.wikimedia.org/r/807904 [17:18:07] 10SRE-swift-storage, 10SRE Observability: swift hosts (thanos-fe1001, ms-be2012) with failed prometheus-ipmi-exporter services - https://phabricator.wikimedia.org/T311262 (10herron) p:05Triage→03Medium [17:18:31] 10SRE-swift-storage, 10SRE Observability: swift hosts (thanos-fe1001, ms-be2012) with failed prometheus-ipmi-exporter services - https://phabricator.wikimedia.org/T311262 (10herron) Looks like we customize the ephemeral port range on the swift hosts to 1024-65535, maybe we can push up the range swift-proxy cho... [17:19:07] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:22] (03PS2) 10Herron: swift: update ephemeral port range from 1024-65535 to 10240-65535 [puppet] - 10https://gerrit.wikimedia.org/r/808040 (https://phabricator.wikimedia.org/T311262) [17:21:06] (03CR) 10CI reject: [V: 04-1] Revert "durum: add monitoring::service for the check service" [puppet] - 10https://gerrit.wikimedia.org/r/807904 (owner: 10Ssingh) [17:22:02] (03PS2) 10Ssingh: Revert "durum: add monitoring::service for the check service" [puppet] - 10https://gerrit.wikimedia.org/r/807904 [17:31:45] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ssingh) During the upgrade to bird2 today, the bird side of things seems to have caused no issues. The bird2 service started successfully and the configuration file was correct. Howeve... [17:32:31] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:33] (03CR) 10Ssingh: [C: 03+2] Revert "durum: add monitoring::service for the check service" [puppet] - 10https://gerrit.wikimedia.org/r/807904 (owner: 10Ssingh) [17:37:53] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:53] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:13] (03PS1) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) [17:47:43] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36018/console" [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [17:48:00] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [17:49:33] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [17:50:01] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [17:50:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:42] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:46] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:54:28] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:56:01] (03CR) 10Ssingh: [V: 03+1] "This change was copied from the earlier commit that we merged and later reverted (805874) with one kye change as pointed out by Arzhel tha" [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [17:56:19] (03CR) 10Ssingh: [V: 03+1] "Ready for review but to be merged next week." [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [17:57:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [17:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [17:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:31] back [17:57:37] brennen: I am around ;] [17:57:48] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Dzahn) Thank you for the examples. That makes sense to me. Especially if Dell advises to keep them secret. [17:58:03] (03PS1) 10Majavah: openstack: fix tools spreadcheck [puppet] - 10https://gerrit.wikimedia.org/r/808044 [18:00:05] hashar and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T1800). [18:00:39] (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) [18:00:47] o/ [18:01:06] hashar: i will go ahead and roll forward [18:01:11] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:01:13] \o/ [18:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:15] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Downtimed on Icinga/Alertmanage... [18:01:28] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:01:58] !log train 1.39.0-wmf.17 (T308070): no current blockers - rolling to all wikis [18:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:03] T308070: 1.39.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T308070 [18:02:35] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808046 (https://phabricator.wikimedia.org/T308070) [18:02:35] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) Thank you @BTullis for all the details. Now I know what DSE means. If they doc could be public,... [18:02:38] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808046 (https://phabricator.wikimedia.org/T308070) (owner: 10Brennen Bearnes) [18:03:13] I am pleased with the latest enhancements to the train process :D [18:03:20] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808046 (https://phabricator.wikimedia.org/T308070) (owner: 10Brennen Bearnes) [18:03:23] (03PS1) 10RobH: dumpsdata100[67] partman testing [puppet] - 10https://gerrit.wikimedia.org/r/808047 (https://phabricator.wikimedia.org/T302937) [18:04:16] (03CR) 10RobH: [C: 03+2] dumpsdata100[67] partman testing [puppet] - 10https://gerrit.wikimedia.org/r/808047 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH) [18:05:56] (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) [18:06:38] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [18:07:34] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.17 refs T308070 [18:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:39] T308070: 1.39.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T308070 [18:07:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:41] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:48] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:08:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:08:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:10:27] (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) [18:11:22] brennen: that looks quiet :] [18:12:18] yep, so far so good [18:16:47] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) [18:17:32] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) p:05Triage→03Low [18:17:47] (03CR) 10EllenR: [C: 03+1] "code looks good, not tested on my machine though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [18:20:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [18:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:24] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10Dzahn) We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs. So I would suggest to check if you can get... [18:22:41] fun finding, anytime we deploy there is a small spike of 5xx errors reported https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-6h&to=now&viewPanel=63 (tick mw deploy and train deployments at top) [18:22:48] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) >>! In T311264#8023977, @Dzahn wrote: > We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs. >... [18:22:54] (03PS1) 10Papaul: Add new model for new PDU in rack A1 [puppet] - 10https://gerrit.wikimedia.org/r/808048 (https://phabricator.wikimedia.org/T303696) [18:22:58] I am guessing it is related to the php fpm rolling restart [18:23:24] (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey to PROD (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [18:23:40] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10Dzahn) See T283582 [18:23:43] (03CR) 10Papaul: [C: 03+2] Add new model for new PDU in rack A1 [puppet] - 10https://gerrit.wikimedia.org/r/808048 (https://phabricator.wikimedia.org/T303696) (owner: 10Papaul) [18:24:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [18:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:39] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:27:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) a:05LSobanski→03Papaul [18:29:55] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:57] hashar: huh, sure enough [18:30:01] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Removed f... [18:30:13] brennen: the wikis look fine [18:30:39] yeah, temporary effect from restart would make sense. [18:30:56] although i think even before we were restarting i remember that there's usually a brief spike in timeouts etc. [18:32:24] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) [18:36:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) a:05Papaul→03LSobanski [18:38:27] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:33] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:39:12] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [18:40:52] (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey to PROD (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [18:44:52] (03PS2) 10Samtar: QuickSurveys (beta): Fix typo and deploy to 'enwiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808035 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:45:52] 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) 05Resolved→03Open [18:47:27] (03PS9) 10Gehel: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [18:47:28] PROBLEM - Check systemd state on an-worker1140 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:02] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [18:49:57] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:01] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Downtimed on Icinga/Alertmanage... [18:50:32] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:25] (03PS1) 10Dzahn: deployment_server: remove packages wrk, siege and lua-cjson [puppet] - 10https://gerrit.wikimedia.org/r/808052 (https://phabricator.wikimedia.org/T230178) [18:53:27] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:54:08] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:09] 10SRE, 10Patch-For-Review, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) @hashar here you go :) https://gerrit.wikimedia.org/r/808052 [18:55:00] (03CR) 10Samtar: [C: 03+1] "lgtm ✔" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808035 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:55:29] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:34] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:56:13] (03CR) 10Dzahn: [C: 03+2] base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [18:59:31] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:59:46] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Ladsgroup) I suggest waiting for a bit before making a mailing list if this campaign is in such early stages? [19:01:41] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [19:05:10] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Anasskoko) Alright we get the page campaign ready in the next few days, we want to create the page quick with all other means of communication starting from social media hand... [19:08:30] (03PS10) 10Gehel: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:09:19] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:15:05] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:13] (03PS11) 10Ryan Kemper: elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [19:18:46] (03PS12) 10Ryan Kemper: elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [19:18:55] (03CR) 10Dzahn: "currently we have compiler failures on deployment with this due to missing (fake) secrets:" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:19:18] (03PS13) 10Ryan Kemper: elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [19:21:51] (03PS1) 10Ryan Kemper: elastic: remove decom'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/808054 (https://phabricator.wikimedia.org/T302517) [19:22:47] (03CR) 10Ryan Kemper: [C: 03+2] elastic: remove decom'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/808054 (https://phabricator.wikimedia.org/T302517) (owner: 10Ryan Kemper) [19:23:13] (03PS1) 10Jdlrobson: [cleanup] Drop non-existent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808055 [19:23:15] (03PS1) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) [19:23:17] (03PS1) 10Jdlrobson: Enable title above tabs on all opt-in wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808057 (https://phabricator.wikimedia.org/T310054) [19:23:21] (03PS1) 10Dzahn: add missing fake keys for keyholder/trainbranchbot [labs/private] - 10https://gerrit.wikimedia.org/r/808058 (https://phabricator.wikimedia.org/T310620) [19:23:36] (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/808058" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:24:10] (03CR) 10CI reject: [V: 04-1] Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [19:24:44] (03PS2) 10Dzahn: add missing fake keys for keyholder/trainbranchbot [labs/private] - 10https://gerrit.wikimedia.org/r/808058 (https://phabricator.wikimedia.org/T310620) [19:24:47] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:53] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Removed from Puppet and PuppetD... [19:25:57] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10Dzahn) fake secrets were needed to be able to puppet compile scap changes such as https://gerrit.wikimedia.org/r/c/operations/puppet/+/806397 [19:26:13] (03PS14) 10Ryan Kemper: elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [19:26:56] (03PS1) 10RobH: adding lvm wipe for recipe [puppet] - 10https://gerrit.wikimedia.org/r/808060 (https://phabricator.wikimedia.org/T302937) [19:27:00] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36022/console" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:27:15] (03CR) 10Dzahn: [C: 03+2] add missing fake keys for keyholder/trainbranchbot [labs/private] - 10https://gerrit.wikimedia.org/r/808058 (https://phabricator.wikimedia.org/T310620) (owner: 10Dzahn) [19:27:17] (03PS2) 10RobH: adding lvm wipe for recipe [puppet] - 10https://gerrit.wikimedia.org/r/808060 (https://phabricator.wikimedia.org/T302937) [19:27:30] (03CR) 10RobH: [C: 03+2] adding lvm wipe for recipe [puppet] - 10https://gerrit.wikimedia.org/r/808060 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH) [19:28:39] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add missing fake keys for keyholder/trainbranchbot [labs/private] - 10https://gerrit.wikimedia.org/r/808058 (https://phabricator.wikimedia.org/T310620) (owner: 10Dzahn) [19:29:28] robh: I merged your change [19:29:59] mutante: I'm around if you need me for stuff related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/806397 [19:32:10] dancy: it compiles now: https://puppet-compiler.wmflabs.org/pcc-worker1002/36023/ [19:32:21] i'll merge it [19:32:41] phab1001 is in there as an example of a scap::target [19:32:54] Gotcha. [19:33:07] (03CR) 10Dzahn: [C: 03+1] "after adding the fake keys it compiles now and lgtm https://puppet-compiler.wmflabs.org/pcc-worker1002/36023/" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:34:07] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:11] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:37:26] dancy: was just double checking this does NOT run on all the existing scap target appservers .. but "creates" parameter is good for that [19:37:37] (03CR) 10Dzahn: [C: 03+2] scap bootstrap: refactor [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:37:55] Nod. I expect the `creates` part to prevent this from doing anything interesting on existing hosts. [19:37:57] [mwdebug1001:~] $ file /var/lib/scap/scap/bin/scap [19:38:07] is correct and exists [19:38:07] I would like to have a test for a new host soon [19:38:39] deploying.running puppet on mwdebug1001 [19:39:07] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:11] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Removed from Puppet and PuppetD... [19:39:43] dancy: arr.an issue appears that is not caught in compiler [19:40:15] * dancy shakes a fist [19:40:22] Warning: /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki]: Skipping because of failed dependencies [19:41:00] it's a dependency issue because bootstrap-scap-target.sh does not exist [19:41:32] https://phabricator.wikimedia.org/P30043 [19:42:11] the file url `puppet:///modules/scap/files/bootstrap-scap-target.sh` should not have `/files/` in it, I think [19:42:22] (03CR) 10Dzahn: [C: 03+2] "there is a dependency issue here: https://phabricator.wikimedia.org/P30043" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:42:40] taavi sounds right :) [19:42:48] (Memory over 85%) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [19:43:02] "files" is automagically skipped and causes this quite a bit [19:44:40] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) So when attempting to run the updated partman recipe I get the following: {F35268994} │ Unable to automatically remove LVM data │ │ Because the volume group(s) on the sele... [19:45:33] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04605 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:45:42] (03PS2) 10Dzahn: deployment_server: remove packages wrk, siege and lua-cjson [puppet] - 10https://gerrit.wikimedia.org/r/808052 (https://phabricator.wikimedia.org/T230178) [19:45:44] (03PS1) 10Dzahn: scap: remove 'files' from file path [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) [19:46:29] ACKNOWLEDGEMENT - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04605 ge 0.01 daniel_zahn https://phabricator.wikimedia.org/P30043 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:46:42] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/808061/" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:47:48] (Memory over 85%) resolved: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [19:49:32] (03CR) 10Zabe: [C: 03+1] scap: remove 'files' from file path [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) (owner: 10Dzahn) [19:49:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36024/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) (owner: 10Dzahn) [19:49:54] (03PS2) 10Dzahn: scap: remove 'files' from file path [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) [19:51:13] (03CR) 10Dzahn: [V: 03+2] scap: remove 'files' from file path [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) (owner: 10Dzahn) [19:52:23] Notice: /Stage[main]/Scap/File[/usr/local/bin/bootstrap-scap-target.sh]/ensure: defined content as '{md5}c1b63bd9017a681fd52a28e4faad4b8d' [19:53:02] Notice: Applied catalog in 50.66 seconds [19:53:10] sweeeeet! [19:53:39] so now I just want that "widespread puppet failures" alert to recover in a bit [19:53:48] but besides that we should be good [19:54:00] Thanks for pushing it along. [19:54:01] looks good on mwdebug1001 and phab1001 [19:54:10] yw [19:54:39] (03PS1) 10Zabe: scap: remove 'files' from puppet file url [puppet] - 10https://gerrit.wikimedia.org/r/808062 [19:56:17] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' visualenhancements as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) [19:56:19] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools topicsubscription on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) [19:56:34] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:25] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools topicsubscription, autotopicsub on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) [19:59:09] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools visualenhancements on beta cluster as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) [19:59:29] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools topicsubscription, autotopicsub on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) [19:59:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220623T2000). [20:00:06] danisztls, eigyan, koi, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] hello [20:00:30] Greetings Everyone [20:00:31] I'm here [20:00:34] greetings [20:00:38] hi there [20:01:01] o/ howdy all [20:01:12] (03CR) 10Dzahn: [C: 03+2] "after the follow-up it looks good now. on mwdebug1001/phab1001 the shell script was created but not executed. nothing else happened. as it" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [20:01:18] o/ [20:01:32] busy one this evening :) [20:01:34] scap refactored just in time, heh [20:01:39] o/ [20:01:40] (03CR) 10Thcipriani: [C: 03+2] QuickSurveys (beta): Fix typo and deploy to 'enwiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808035 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:02:27] (03Merged) 10jenkins-bot: QuickSurveys (beta): Fix typo and deploy to 'enwiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808035 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:03:43] mutante: oh boy and now I get to test it :D [20:04:05] thcipriani: uhm.. puppet is currently failed on a bunch of mw hosts [20:04:11] but next run will fix it [20:04:30] danisztls: ^ your change should be live on with the next run of https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ [20:04:43] (currently pending) [20:04:47] thanks thcipriani [20:04:57] danisztls: yw :) [20:05:00] (03PS4) 10Thcipriani: [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [20:05:04] (03CR) 10Thcipriani: [C: 03+2] [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [20:05:16]  thank you thcipriani [20:05:31] yw eigyan :) [20:05:44] :D [20:05:51] (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey to PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808045 (https://phabricator.wikimedia.org/T311261) (owner: 10Eigyan) [20:05:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:06:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:53] !log cumin -b 15 -p 95 'wtp*' 'run-puppet-agent -q --failed-only' [20:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:31] (03PS2) 10Thcipriani: ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang) [20:09:29] eigyan: your change is live on mwdebug1002, check please! [20:09:47] Excellent will do thcipriani [20:09:50] (03PS4) 10Zabe: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) [20:09:57] !log cumin -b 15 -p 95 'parse*' 'run-puppet-agent -q --failed-only' [20:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:38] (03CR) 10CI reject: [V: 04-1] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [20:11:02] (03CR) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:11:16] !log cumin -b 15 -p 95 'mw2*' 'run-puppet-agent -q --failed-only' [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:08] eigyan: seeing some warnings on the debug server for array_key_exists in quicksurveys--is that related to this patch? [20:12:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:04] thcipriani: i am adding a patch to the backport window if that's okay? [20:13:24] Jdlrobson: sure [20:13:27] Hmmm...I didn't catch any warnings that I know of thcipriani [20:14:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:49] Also I am not seeing the surveys I am interested in on mwdebug1002 [20:14:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:15:36] !log cumin -b 15 -p 95 'mw1*' 'run-puppet-agent -q --failed-only' [20:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:07] (03CR) 10Esanders: [C: 03+1] Enable DiscussionTools visualenhancements on beta cluster as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) (owner: 10Bartosz Dziewoński) [20:17:16] (03PS1) 10Jdlrobson: Change default skin on next set of pilot wikis to Vector (2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808067 (https://phabricator.wikimedia.org/T307903) [20:17:49] eigyan: hrm, maybe it's due to this warning I'm seeing, "array_key_exists() expects parameter 2 to be array, string given" seems to be related to your patch. I can file a bug with a backtrace if that's helpful? [20:18:05] thcipriani: added ^ [20:18:27] that would surely help thcipriani thank you! [20:18:33] will do [20:19:55] (03PS2) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) [20:19:57] eigyan: here you go: https://phabricator.wikimedia.org/T311271 [20:20:10] I'm going to revert for the time being, generates a lot of log noise [20:20:14] Many thanks thcipriani [20:20:31] Perfect sounds good [20:20:36] (03PS1) 10Thcipriani: Revert "[wmf-config]: Deploy GDI Survey to PROD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807905 [20:20:48] (03CR) 10Thcipriani: [C: 03+2] Revert "[wmf-config]: Deploy GDI Survey to PROD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807905 (owner: 10Thcipriani) [20:21:32] (03Merged) 10jenkins-bot: Revert "[wmf-config]: Deploy GDI Survey to PROD" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807905 (owner: 10Thcipriani) [20:21:44] (03PS2) 10Dzahn: scap: remove 'files' from puppet file url [puppet] - 10https://gerrit.wikimedia.org/r/808062 (https://phabricator.wikimedia.org/T310740) (owner: 10Zabe) [20:21:51] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:13] (03PS3) 10Dzahn: scap: remove 'files' from puppet file url [puppet] - 10https://gerrit.wikimedia.org/r/808062 (https://phabricator.wikimedia.org/T310740) (owner: 10Zabe) [20:22:27] (03CR) 10Dzahn: "thank you, Zabe" [puppet] - 10https://gerrit.wikimedia.org/r/808062 (https://phabricator.wikimedia.org/T310740) (owner: 10Zabe) [20:22:31] (03PS3) 10Thcipriani: ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang) [20:22:36] (03CR) 10Thcipriani: [C: 03+2] ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang) [20:23:29] (03Merged) 10jenkins-bot: ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang) [20:23:52] (03PS1) 10Jdlrobson: Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [skins/Vector] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808068 (https://phabricator.wikimedia.org/T310197) [20:24:27] koi: your change is live on mwdebug1002, check please! [20:24:33] lloking [20:24:39] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003036 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:24:52] (03PS2) 10Jdlrobson: [cleanup] Drop non-existent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808055 [20:25:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:11] brb [20:25:34] thcipriani: also added some opportunistic cleanup ^ [20:25:45] eigyan: your change has been reverted, feel free to drop :) [20:25:56] Jdlrobson: neat :) [20:26:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:41] no problem I need to spend some time reviewing the error. thanks again for all your help thcipriani [20:26:53] sure thing, yw eigyan ! [20:27:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:17] (03PS1) 10Jdlrobson: Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808071 (https://phabricator.wikimedia.org/T310197) [20:27:22] (03CR) 10CI reject: [V: 04-1] Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [skins/Vector] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808068 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:29:25] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [20:29:26] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) So if the idrac is accessible, the firmware update isn't OS impacting. However, I cannot login to this idrac interface via HTTPS or SSH, so it appears it'll have to be fully power dr... [20:30:30] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl1001.eqiad.wmnet [20:30:31] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:57] thcipriani: I'm not sure about that, how to actually test a modification of wgContentNamespaces? [20:31:27] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) ` dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 --network pr... [20:32:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:33:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:14] koi: ¯\_(ツ)_/¯ [20:33:25] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) > Hello Rob, > > > > We’ll open to our 2nd line to dispatch a technician to loop test at our panel. > > > > In future, for the first step please let us know while bi-directional loops are placed... [20:34:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:09] (03PS3) 10Thcipriani: Enable DiscussionTools visualenhancements on beta cluster as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) (owner: 10Bartosz Dziewoński) [20:34:16] (03CR) 10Thcipriani: [C: 03+2] Enable DiscussionTools visualenhancements on beta cluster as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) (owner: 10Bartosz Dziewoński) [20:34:38] anyway, although I failed to found how to validate so, it's ok to do a sync [20:35:13] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:35:16] (03Merged) 10jenkins-bot: Enable DiscussionTools visualenhancements on beta cluster as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808063 (https://phabricator.wikimedia.org/T311269) (owner: 10Bartosz Dziewoński) [20:35:29] koi: were there pages in the namespace previously [20:36:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:36:16] thcipriani: https://uk.wikibooks.org/wiki/Special:Allpages?from=&to=&namespace=102 [20:36:34] MatmaRex: https://gerrit.wikimedia.org/r/808063 should go live with the next https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ FYI [20:37:01] yep, thanks [20:38:19] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:42] koi: okie doke, going live [20:39:46] (03PS1) 10Zabe: security: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/808072 (https://phabricator.wikimedia.org/T308013) [20:40:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:40:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:05] (03CR) 10Eevans: [C: 03+1] deployment_server: remove packages wrk, siege and lua-cjson [puppet] - 10https://gerrit.wikimedia.org/r/808052 (https://phabricator.wikimedia.org/T230178) (owner: 10Dzahn) [20:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:30] (03PS4) 10Thcipriani: Enable DiscussionTools topicsubscription, autotopicsub on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) (owner: 10Bartosz Dziewoński) [20:40:42] (03CR) 10Thcipriani: [C: 03+2] Enable DiscussionTools topicsubscription, autotopicsub on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) (owner: 10Bartosz Dziewoński) [20:41:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:41] (03Merged) 10jenkins-bot: Enable DiscussionTools topicsubscription, autotopicsub on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808064 (https://phabricator.wikimedia.org/T310808) (owner: 10Bartosz Dziewoński) [20:43:02] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:43:02] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1001.eqiad.wmnet on all recursors [20:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1001.eqiad.wmnet on all recursors [20:43:09] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:56] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806847|ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces (T310940)]] (duration: 03m 41s) [20:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:01] T310940: Change in $wgContentNamespaces for ukwikibooks - https://phabricator.wikimedia.org/T310940 [20:44:43] MatmaRex: your second change is on mwdebug1002, check please! [20:45:22] thcipriani: looks good [20:45:39] MatmaRex: cool, thanks for checking, syncing now [20:45:57] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [20:46:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:17] (03CR) 10AOkoth: [C: 03+2] gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:46:29] (03PS2) 10AOkoth: gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:46:40] (03PS2) 10Thcipriani: Change default skin on next set of pilot wikis to Vector (2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808067 (https://phabricator.wikimedia.org/T307903) (owner: 10Jdlrobson) [20:46:48] (03CR) 10Thcipriani: [C: 03+2] Change default skin on next set of pilot wikis to Vector (2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808067 (https://phabricator.wikimedia.org/T307903) (owner: 10Jdlrobson) [20:47:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:52] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [20:47:53] (03Merged) 10jenkins-bot: Change default skin on next set of pilot wikis to Vector (2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808067 (https://phabricator.wikimedia.org/T307903) (owner: 10Jdlrobson) [20:48:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:48:09] (03PS3) 10Thcipriani: [cleanup] Drop non-existent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808055 (owner: 10Jdlrobson) [20:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:24] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:48:24] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1001.eqiad.wmnet on all recursors [20:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1001.eqiad.wmnet on all recursors [20:48:27] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dse-k8s-ctrl1001.eqiad.wmnet [20:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:19] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:808064|Enable DiscussionTools topicsubscription, autotopicsub on testwiki (T310808)]] (duration: 03m 18s) [20:49:20] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) @BTullis I tried to create one for you but the cookbook failed at the DNS update step: `FAIL .... [20:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:23] T310808: Turn on DiscussionTools subscription feature at testwiki - https://phabricator.wikimedia.org/T310808 [20:49:41] MatmaRex: should be live now :) [20:50:02] thanks [20:50:38] I think I filed that ticket. Thanks for accepting :) [20:50:51] Jdlrobson: first one is live on mwdebug1002, check please! [20:52:12] (03PS3) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) [20:52:22] thcipriani: thanks looking [20:52:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:54:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:22] thcipriani: LGTM! [20:54:43] (03PS2) 10AOkoth: DHCP: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806862 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:55:00] Jdlrobson: cool, going live [20:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:55:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:27] the next one should be a NOOP [20:56:12] !log thcipriani@deploy1002 Started scap: Config: [[gerrit:808067|Change default skin on next set of pilot wikis to Vector (2022) (T307903)]] [20:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:15] T307903: Change default skin on next set of pilot wikis to Vector (2022) - https://phabricator.wikimedia.org/T307903 [20:56:32] (03CR) 10Dzahn: [C: 03+2] scap: remove 'files' from puppet file url [puppet] - 10https://gerrit.wikimedia.org/r/808062 (https://phabricator.wikimedia.org/T310740) (owner: 10Zabe) [20:57:25] (03CR) 10AOkoth: [C: 03+2] DHCP: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806862 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:59:09] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:01:34] !log looking in to wdqs1006 alert ^^ [21:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:59] (03CR) 10Dzahn: [C: 03+1] gitlab_runner/hiera: change docker volume size in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/807991 (owner: 10Jelto) [21:06:07] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:07:13] @thcipriani do you need me to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808055/ or can we just sync it? [21:08:16] Jdlrobson: these are unused feature flags, correct? [21:08:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:59] (03PS3) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476 [21:13:41] !log thcipriani@deploy1002 Finished scap: Config: [[gerrit:808067|Change default skin on next set of pilot wikis to Vector (2022) (T307903)]] (duration: 17m 29s) [21:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:47] T307903: Change default skin on next set of pilot wikis to Vector (2022) - https://phabricator.wikimedia.org/T307903 [21:14:29] Jdlrobson: if they're just unused feature flags, I can ensure nothing explodes [21:14:43] yep unused [21:14:50] (03CR) 10Thcipriani: [C: 03+2] [cleanup] Drop non-existent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808055 (owner: 10Jdlrobson) [21:15:00] Jdlrobson: k, I can check it [21:15:38] (03Merged) 10jenkins-bot: [cleanup] Drop non-existent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808055 (owner: 10Jdlrobson) [21:15:39] Jdlrobson: t.hcipriani's got a meeting, i can pick up this last one. [21:15:56] (03PS4) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476 [21:16:49] (03CR) 10Dzahn: [C: 03+2] deployment_server: remove packages wrk, siege and lua-cjson [puppet] - 10https://gerrit.wikimedia.org/r/808052 (https://phabricator.wikimedia.org/T230178) (owner: 10Dzahn) [21:18:20] nothing blows up on mwdebug1002, syncing. [21:18:35] :) [21:18:38] that's good to know [21:20:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:17] 10SRE, 10Patch-For-Review, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[siege]/ensure: removed Notice: /Stage[main]/Profile::Mediawiki::Depl... [21:21:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:21:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:36] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:808055|[cleanup] Drop non-existent feature flags]] (duration: 03m 33s) [21:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:50] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Jclark-ctr) a:03Jclark-ctr [21:22:04] !log end of utc late backport & config window [21:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:24] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:23:52] !log restbase-dev1006 has manually installed packages (wrk, maybe others) [21:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:32] 10SRE, 10Patch-For-Review, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) 05Open→03Resolved [21:26:18] (03CR) 10Dzahn: "wow,nice catch" [puppet] - 10https://gerrit.wikimedia.org/r/807927 (owner: 10Jelto) [21:29:29] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:31:20] (03PS1) 10Chad: Reinstate my shell account, grab all the roles for RelEng [puppet] - 10https://gerrit.wikimedia.org/r/808079 [21:44:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:39] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [21:56:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:40] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10colewhite) 05Open→03Resolved a:03colewhite Thank you for the report. Users experienced connectivity issues to the projects starting at 5:05 UTC. Service was restored at 05:11... [22:06:01] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:09:49] (03CR) 10David Caro: [C: 03+2] "Thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/808044 (owner: 10Majavah) [22:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:10:14] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) Still running into some problems after rebuilding and upgrading. Primarily, incidents created are missing name/id on both a fresh install, as we... [22:16:15] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:31:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:26] (03PS2) 10Jdlrobson: Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [skins/Vector] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808068 (https://phabricator.wikimedia.org/T310197) [22:36:43] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:37:47] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:23] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:25] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:11:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:27] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:20:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state