[00:03:24] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P20951 and previous config saved to /var/cache/conftool/dbconfig/20220217-000355-marostegui.json [00:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:40] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:28] (03PS8) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [00:18:30] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300381)', diff saved to https://phabricator.wikimedia.org/P20952 and previous config saved to /var/cache/conftool/dbconfig/20220217-001859-marostegui.json [00:19:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [00:19:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [00:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:07] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [00:19:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20953 and previous config saved to /var/cache/conftool/dbconfig/20220217-001907-marostegui.json [00:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:20] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [00:23:14] (03PS9) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [00:26:46] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Matthiasb) I urge to investigate wether the Russian issue can be minimized if DPL is not used on category pages... [00:38:02] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:34] getting intermittent failures from US east coast [00:38:41] upstream connect error or disconnect/reset before headers. reset reason: overflow [00:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [00:39:00] a few other reports of the same in Discord [00:39:22] AntiComposite: thanks, looking [00:39:47] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.2685 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [00:40:07] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [00:40:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:41:08] online, just got the page [00:41:24] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:41:26] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.es [00:41:26] t, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:41:34] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:41:40] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3060 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Varnish [00:41:52] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3058 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Varnish [00:41:52] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:42:11] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8909 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [00:42:26] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:42:26] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:42:26] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:42:32] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.77 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:42:34] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:42:40] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3050.es [00:42:40] t are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:42:44] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:42:48] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:00] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:00] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:01] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:12] PROBLEM - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:43:32] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:46] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:50] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:43:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [00:44:04] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:04] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:14] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:14] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:14] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:34] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:34] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:42] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:44:42] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:02] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:45:06] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:45:14] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:20] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:27] RECOVERY - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:45:38] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:46:58] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=frontend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [00:47:14] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:47:14] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:47:18] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [00:47:54] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=frontend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [00:48:18] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.316 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:48:28] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:48:28] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:48:28] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:48:40] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:48:54] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:48:54] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:32] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:42] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 72.25 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:50:16] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:50:28] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:00] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3058 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:20] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:34] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:41] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [00:52:22] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:55:01] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [00:55:32] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 8.769 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:55:32] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 9.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:55:44] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:56:34] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 7.900 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:56:46] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:57:31] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:57:31] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:58:10] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:59:12] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [01:00:05] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T0100). Please do the needful. [01:00:52] AntiComposite: we were discussing in a private channel but to follow up here -- thanks for the advance heads up, appreciate you being faster than the automatic alerts :) [01:01:23] not the first time, probably won't be the last :) [01:07:38] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [01:18:56] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [01:20:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:36:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20954 and previous config saved to /var/cache/conftool/dbconfig/20220217-013607-marostegui.json [01:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:16] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [01:38:00] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:22] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P20955 and previous config saved to /var/cache/conftool/dbconfig/20220217-015111-marostegui.json [01:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P20956 and previous config saved to /var/cache/conftool/dbconfig/20220217-020616-marostegui.json [02:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:57] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10MZMcBride) 05Resolvedβ†’03Open This issue is still happening. [02:17:04] rzl: Hi. I reopened https://phabricator.wikimedia.org/T301505 just now. Should this ticket be assigned to you? [02:17:55] Oona: sure -- nothing to share on it yet but I can claim the task [02:18:03] sorry for the trouble, will have more to share soon [02:18:25] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10RLazarus) a:05Ladsgroupβ†’03RLazarus [02:18:32] Awesome, thanks so much. [02:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20957 and previous config saved to /var/cache/conftool/dbconfig/20220217-022121-marostegui.json [02:21:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [02:21:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [02:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:28] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [02:21:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T300381)', diff saved to https://phabricator.wikimedia.org/P20958 and previous config saved to /var/cache/conftool/dbconfig/20220217-022128-marostegui.json [02:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:36] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300381)', diff saved to https://phabricator.wikimedia.org/P20959 and previous config saved to /var/cache/conftool/dbconfig/20220217-033159-marostegui.json [03:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:07] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [03:47:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P20960 and previous config saved to /var/cache/conftool/dbconfig/20220217-034704-marostegui.json [03:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P20961 and previous config saved to /var/cache/conftool/dbconfig/20220217-040208-marostegui.json [04:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300381)', diff saved to https://phabricator.wikimedia.org/P20962 and previous config saved to /var/cache/conftool/dbconfig/20220217-041713-marostegui.json [04:17:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [04:17:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [04:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:21] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [04:17:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T300381)', diff saved to https://phabricator.wikimedia.org/P20963 and previous config saved to /var/cache/conftool/dbconfig/20220217-041721-marostegui.json [04:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:53] (03PS1) 104nn1l2: InitialiseSettings: General cleanup, wgAddGroups (R-Z) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) [05:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300381)', diff saved to https://phabricator.wikimedia.org/P20964 and previous config saved to /var/cache/conftool/dbconfig/20220217-054154-marostegui.json [05:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:02] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [05:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P20965 and previous config saved to /var/cache/conftool/dbconfig/20220217-055659-marostegui.json [05:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P20966 and previous config saved to /var/cache/conftool/dbconfig/20220217-061203-marostegui.json [06:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:49] (03PS2) 10Andrew Bogott: backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 [06:12:51] (03PS1) 10Andrew Bogott: backy2: initialize backy2 database if necessary [puppet] - 10https://gerrit.wikimedia.org/r/763401 [06:15:14] (03CR) 10jerkins-bot: [V: 04-1] backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 (owner: 10Andrew Bogott) [06:21:27] (03PS2) 10Andrew Bogott: backy2: initialize backy2 database if necessary [puppet] - 10https://gerrit.wikimedia.org/r/763401 [06:21:29] (03PS3) 10Andrew Bogott: backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 [06:27:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300381)', diff saved to https://phabricator.wikimedia.org/P20967 and previous config saved to /var/cache/conftool/dbconfig/20220217-062708-marostegui.json [06:27:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [06:27:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [06:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:14] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:36] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:19:17] (03PS3) 10BrandonXLF: wiki replicas: Only hide log_params when bit 0 is on in log_delete [puppet] - 10https://gerrit.wikimedia.org/r/758081 (https://phabricator.wikimedia.org/T301943) [07:32:00] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:46:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] logstash::input::kafka: allow a custom truststore path [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:53:05] (03PS1) 10ArielGlenn: add Hannah Okwelum to platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/763456 (https://phabricator.wikimedia.org/T301876) [08:00:04] Amir1 and apergos: It is that lovely time of the day again! You are hereby commanded to deploy UTC early backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T0800). [08:00:04] kart_: A patch you scheduled for UTC early backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] * kart_ is here.. [08:00:20] oh? woops [08:00:28] lemme see whether we have any trainees for the session [08:00:41] nope! [08:00:48] let me look at the patches for today's window [08:00:57] OK. Then, I can self deploy. [08:01:04] oh. I forgot... I am not here, because Code Jam this week, heh [08:01:08] anyways lemme just look [08:02:09] yours is the lone patch, looks reasonable to me, I see it already has a +1 (thank you!), feel free to go ahead [08:02:22] Thanks! :) [08:02:44] (03PS4) 10KartikMistry: Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) [08:04:29] (03CR) 10KartikMistry: [C: 03+2] "Config deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [08:05:11] (03Merged) 10jenkins-bot: Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [08:06:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, + what Cole said" [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:06:35] (03CR) 10Filippo Giunchedi: [C: 03+1] remove deprecated piechart plugin [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763334 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:06:52] (03CR) 10Filippo Giunchedi: [C: 03+1] update grafana-image-renderer to 3.3.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:07:00] (03CR) 10Filippo Giunchedi: [C: 03+1] update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:08:37] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana-next: set grafana codfw base domain to grafana next [puppet] - 10https://gerrit.wikimedia.org/r/763329 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:08:53] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: link alerts to their Icinga web page [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi) [08:09:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:56] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10elukey) Hi everybody, is there a timeline for MOSS? The ML-Team is currently using the Thanos Swift cluster to store objects/models, we don't require a lot of space but at the same time we are not a grea... [08:10:35] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761626|Enable SectionTranslation in Occitan and Luganda WPs + CX out-of-Beta for Luganda WP (T301443)]] (duration: 00m 51s) [08:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:40] T301443: Enable Flores for Occitan and Luganda - https://phabricator.wikimedia.org/T301443 [08:10:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:11] (Although Log message is wrong, taking from the Phab) [08:11:39] (03CR) 10Elukey: "ping :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [08:12:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:14] apergos: I'm done with config patch. [08:13:48] all tested and happy? fabulous! [08:14:17] anyone else with a patch they'd like to add last minute, since there's still plenty of time? [08:19:08] apergos: Yes. All good :) [08:19:33] apergos: I've a patch [08:19:43] should i self-service or do we have a trainee? [08:20:07] this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/763171 [08:20:31] (03PS1) 10Filippo Giunchedi: prometheus: pass extinfo-url to icinga-exporter [puppet] - 10https://gerrit.wikimedia.org/r/763457 (https://phabricator.wikimedia.org/T300859) [08:20:36] good morning [08:21:28] morning. [08:21:31] well in that case... [08:21:32] morning! [08:21:34] !log UTC early B&C window completed [08:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:46] apergos: you must've missed my message above :) [08:21:51] dangit! [08:22:03] !log UTC early B&C window NOT completed, woops. [08:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:11] no trainees, self deploy! [08:22:14] (03PS2) 10Urbanecm: Deploy Growth features to 100% of newcomers on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) [08:22:17] doing :) [08:22:21] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features to 100% of newcomers on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) (owner: 10Urbanecm) [08:23:04] (03Merged) 10jenkins-bot: Deploy Growth features to 100% of newcomers on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) (owner: 10Urbanecm) [08:26:32] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c0cbd3048f9d288b40dbde09506fe212de176f19: Deploy Growth features to 100% of newcomers on most Wikipedias (T301820) (duration: 00m 50s) [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:37] * urbanecm done [08:26:38] T301820: Scale: enable Growth features for 100% of new accounts on most Wikipedias - https://phabricator.wikimedia.org/T301820 [08:26:48] !log UTC early B&C now really done [08:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:08] (03PS1) 10Filippo Giunchedi: am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) [08:28:10] (03PS1) 10Filippo Giunchedi: am: add 'host' label and add port to 'instance' [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763460 (https://phabricator.wikimedia.org/T300951) [08:28:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:28:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:26] thanks for actually closing the window, urbane cm :-D [08:34:31] see everyone next time! [08:37:45] <_joe_> jelto: the output from helmfile is much better now, thanks [08:43:23] (03PS1) 10Majavah: hieradata: pcc: add clouddb-services-puppetmaster-01 key [puppet] - 10https://gerrit.wikimedia.org/r/763461 [08:43:53] about mediawiki train, I have filed a few tasks here and there but nothing concerning really [08:44:09] so I will roll the train to all wikis [08:45:06] though I will delay it a bit since I have a quick meeting at 9:00 UTC [08:45:07] \o/ [08:51:24] _joe_: thanks! I also think the new output is more helpful now [08:55:42] hey urbanecm I notice you didn't add your patch to the deployment calendar, please don't forget to do that so we have a record. :-) [08:55:57] good point, let me do that now [08:56:46] apergos: {{done}} [08:57:21] ty! [09:00:05] hashar and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T0900) [09:14:15] (03CR) 10JMeybohm: [C: 03+1] "As said on IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/763277 (https://phabricator.wikimedia.org/T289131) (owner: 10Elukey) [09:19:20] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763345 (owner: 10Andrew Bogott) [09:28:17] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: enable Priority plugin [puppet] - 10https://gerrit.wikimedia.org/r/763277 (https://phabricator.wikimedia.org/T289131) (owner: 10Elukey) [09:31:31] ok train time [09:31:56] woo hoo! [09:36:50] (03PS1) 10Hashar: all wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763473 [09:36:52] (03CR) 10Hashar: [C: 03+2] all wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763473 (owner: 10Hashar) [09:37:50] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763473 (owner: 10Hashar) [09:39:07] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.22 refs T300198 [09:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:14] T300198: 1.38.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T300198 [09:40:42] Houston we are LIVE! [09:40:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:58] hashar: so, first morning train ever successfully finished? That's perfect πŸ™‚ [09:42:09] yes! we are lucky :] [09:42:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:42:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] hashar: i see T301936 opened under the blockers task though. Dunno if you saw it and assessed it though. [09:42:45] T301936: Interwiki prefix "wikipedia" not working on multilingual wikimedia projects - https://phabricator.wikimedia.org/T301936 [09:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:43:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:34] oops [09:44:38] totally missed out that one [09:45:42] ah it got added as a blocker one hour ago :/ [09:45:48] so after I have checked the list of blockers [09:46:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1017.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:46:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1017.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:41] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [09:47:52] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1017 [09:48:30] so the wikipedia: interwiki got broken ? :\ [09:50:32] !log migrate instances off ganeti1012 [09:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.2 - https://phabricator.wikimedia.org/T265857 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff This has been resolved for a long time, closing. [09:59:59] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) 05In progressβ†’03Resolved [10:00:27] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [10:00:58] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) Ok gonna close this one, range announced and doh working on IPv6 from all our POPs now. I've a separate task - T301900 - to validate the route p... [10:05:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763456 (https://phabricator.wikimedia.org/T301876) (owner: 10ArielGlenn) [10:11:53] (03PS1) 10Gehel: elasticsearch: allow using elasticsearch v6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763477 [10:12:48] (03PS2) 10Gehel: elasticsearch: allow using elasticsearch v6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763477 (https://phabricator.wikimedia.org/T295666) [10:14:43] (03PS1) 10Gehel: elasticsearch: upgrade deployment-prep to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763478 (https://phabricator.wikimedia.org/T301954) [10:15:47] (03PS1) 10Gehel: elasticsearch: upgrade deployment-prep to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) [10:16:05] apergos: Reedy: hi, any clue who might have the knowledge about `wikipedia:` interwikis being broken? https://phabricator.wikimedia.org/T301936 [10:16:28] I am pretty sure I once understood how interwiki worked or were defined but that was several years ago [10:16:39] (03PS1) 10Kevin Bazira: ml-services: add cswiki & dewiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/763480 (https://phabricator.wikimedia.org/T301415) [10:16:41] (03PS1) 10Gehel: elasticsearch: upgrade deployment-prep to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) [10:16:49] my knowledge is as out of date as yours [10:17:10] I can ask in our team channel (or so can you), gotta think about timezones [10:17:13] ah good to know I am not the only one :D [10:17:27] will do [10:17:50] mention it's a train blocker and if it's ubn mention that too [10:18:15] our team is on "Code Jam" this week so we are supposed to not do anything else, obviously if it's ubn/train blocker then we stop and look at that [10:18:28] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10SCherukuwada) Just did. All good, thank you and sorry for the trouble! [10:18:50] (03PS2) 10Gehel: elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) [10:19:00] (03PS2) 10Gehel: elasticsearch: upgrade relforge to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) [10:20:08] (03PS1) 10Majavah: toolsdb primary: come back in read only mode [puppet] - 10https://gerrit.wikimedia.org/r/763482 [10:20:47] (03PS1) 10Gehel: elasticsearch: upgrade codfw to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763483 (https://phabricator.wikimedia.org/T301958) [10:20:49] (03PS1) 10Gehel: elasticsearch: upgrade eqiad to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763484 (https://phabricator.wikimedia.org/T301959) [10:22:12] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10MMandere) 05In progressβ†’03Resolved a:03MMandere Thank you @SCherukuwada for confirming. We'll have the task marked as resolved for now, please reopen if you experience any ne... [10:25:15] (03CR) 10Elukey: [C: 03+2] ml-services: add cswiki & dewiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/763480 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [10:26:53] (03PS1) 10EJoseph: Upgrade to elasticsearch 7.10.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/763485 (https://phabricator.wikimedia.org/T299226) [10:32:16] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:48] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:16] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [10:38:21] 10SRE, 10Infrastructure-Foundations, 10Traffic: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) 05Openβ†’03Resolved IPv6 support for Wikidough and durum was finalized in T301165. Thanks to Arzhel, Cathal, and John Bond for all the... [10:39:07] (03PS2) 10Btullis: Remove the old AQS nodes from the aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/761884 (https://phabricator.wikimedia.org/T297803) [10:40:49] (03CR) 10Btullis: [C: 03+2] Remove the old AQS nodes from the aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/761884 (https://phabricator.wikimedia.org/T297803) (owner: 10Btullis) [10:41:26] (03PS1) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [10:42:02] (03CR) 10jerkins-bot: [V: 04-1] conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [10:46:39] !log running schema change against s5 T300774 [10:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:45] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:46:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:46:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20968 and previous config saved to /var/cache/conftool/dbconfig/20220217-104653-kormat.json [10:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:11] (03PS2) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [10:58:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/763323 (https://phabricator.wikimedia.org/T300076) (owner: 10Jbond) [10:58:47] (03CR) 10jerkins-bot: [V: 04-1] conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [11:00:05] mvolz: #bothumor I οΏ½ Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1100). [11:01:30] (03CR) 10Jbond: [C: 03+2] wikimedia.org: Add MS O365 txt record [dns] - 10https://gerrit.wikimedia.org/r/763323 (https://phabricator.wikimedia.org/T300076) (owner: 10Jbond) [11:01:34] !log installing python3.5 security uodates [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:27] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10jbond) this is in place now, notice the list TXT line below ` lang=console $ dig txt wikimedia.org @ns0.wikimedia.org... [11:11:12] 10SRE, 10DNS, 10Traffic, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10jbond) 05Openβ†’03Stalled [11:13:43] I have confirmed the wikipedia: interwiki is broken due to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/760695 [11:14:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20969 and previous config saved to /var/cache/conftool/dbconfig/20220217-111447-kormat.json [11:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:52] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:21:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "one slightly misleading task ID (I think), LGTM otherwise" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:23:03] (03PS1) 10Hashar: Revert "Optimise Skin::getLanguages()" [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763294 (https://phabricator.wikimedia.org/T301936) [11:27:11] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: elastic1043.eqiad.wmnet [11:27:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1043.eqiad.wmnet [11:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:38] (03PS7) 10Minato826: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) [11:28:39] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: elastic1046.eqiad.wmnet [11:28:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1046.eqiad.wmnet [11:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P20970 and previous config saved to /var/cache/conftool/dbconfig/20220217-112951-kormat.json [11:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] (03CR) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:36:41] (03PS6) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [11:36:43] (03PS11) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [11:36:47] (03CR) 10ZPapierski: [C: 03+1] elasticsearch: allow using elasticsearch v6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763477 (https://phabricator.wikimedia.org/T295666) (owner: 10Gehel) [11:37:37] (03CR) 10jerkins-bot: [V: 04-1] R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:43:24] (03PS12) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [11:44:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolsdb primary: come back in read only mode [puppet] - 10https://gerrit.wikimedia.org/r/763482 (owner: 10Majavah) [11:44:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P20971 and previous config saved to /var/cache/conftool/dbconfig/20220217-114456-kormat.json [11:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:09] 10SRE, 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10MMandere) >>! In T301098#7714249, @gerritbot wrote: > Change 763200 **merged** by MMandere: > %%%[operations/puppet@production] admin: Change Kinneret username%%% > https://gerr... [12:00:01] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20972 and previous config saved to /var/cache/conftool/dbconfig/20220217-120001-kormat.json [12:00:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:00:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:00:07] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:00:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300774)', diff saved to https://phabricator.wikimedia.org/P20973 and previous config saved to /var/cache/conftool/dbconfig/20220217-120014-kormat.json [12:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:26] (03PS1) 10Majavah: prometheus: add heartbeat collection on mysqld_exporter [puppet] - 10https://gerrit.wikimedia.org/r/763490 [12:03:53] (03CR) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [12:13:49] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 (10Majavah) 05Openβ†’03Resolved a:03Majavah [12:18:48] (03PS1) 10MMandere: admin: Change Kinneret username [puppet] - 10https://gerrit.wikimedia.org/r/763498 (https://phabricator.wikimedia.org/T301098) [12:19:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763498 (https://phabricator.wikimedia.org/T301098) (owner: 10MMandere) [12:20:07] (03CR) 10MMandere: [C: 03+2] admin: Change Kinneret username [puppet] - 10https://gerrit.wikimedia.org/r/763498 (https://phabricator.wikimedia.org/T301098) (owner: 10MMandere) [12:25:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300774)', diff saved to https://phabricator.wikimedia.org/P20974 and previous config saved to /var/cache/conftool/dbconfig/20220217-122557-kormat.json [12:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:04] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:30:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10MMandere) 05In progressβ†’03Resolved a:03MMandere Marking this task as resolved. @KinneretG, please feel free to reopen it if you encounter any new iss... [12:41:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P20975 and previous config saved to /var/cache/conftool/dbconfig/20220217-124102-kormat.json [12:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:30] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) i ran a script yesterday which has collected all the current drac and bios versions. Sorry... [12:44:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MMandere) [12:53:03] (03CR) 10Jbond: conftool: add request-actions / request-patterns (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [12:56:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P20976 and previous config saved to /var/cache/conftool/dbconfig/20220217-125607-kormat.json [12:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:56] !log installing expat security updates [13:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300774)', diff saved to https://phabricator.wikimedia.org/P20977 and previous config saved to /var/cache/conftool/dbconfig/20220217-131111-kormat.json [13:11:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:11:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:18] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:31] (03PS1) 10Muehlenhoff: Add Cumin alias for durum [puppet] - 10https://gerrit.wikimedia.org/r/763509 [13:13:43] (03PS2) 10Muehlenhoff: Add Cumin alias for durum [puppet] - 10https://gerrit.wikimedia.org/r/763509 [13:16:41] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 20 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:18:27] !log installing zsh security updates [13:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:08] (03PS2) 104nn1l2: InitialiseSettings: General cleanup, wgAddGroups (R-Z) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) [13:19:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:23:24] (03CR) 104nn1l2: InitialiseSettings: General cleanup, wgAddGroups (R-Z) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [13:35:51] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:35:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:43:20] !log installing paramiko securiy updates [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:08] (03CR) 10David Caro: backy2: on Bullseye, hack around a silly package name mismatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [13:57:02] (03CR) 10Ssingh: [C: 03+1] "Thank you for this patch!" [puppet] - 10https://gerrit.wikimedia.org/r/763509 (owner: 10Muehlenhoff) [13:58:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:58:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20979 and previous config saved to /var/cache/conftool/dbconfig/20220217-135831-kormat.json [13:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:38] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC evening backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1400). [14:00:04] nn1l2 and anoop: A patch you scheduled for UTC evening backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] hi [14:00:14] Hello [14:00:21] (03CR) 10Jbond: "just noticed i never pressed send on this comment :P" [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond) [14:00:26] (03PS6) 10Jbond: C:cassandra: add optional java_package variable [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) [14:00:36] o/ [14:00:44] so the timing of this window turns out to be a bit unclear… [14:01:05] (03CR) 10Jbond: C:package_builder: Add Script for building debian packages from git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [14:01:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] InitialiseSettings: General cleanup, wgAddGroups (R-Z) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:02:16] (03PS3) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [14:03:01] but since there’s nothing else in the calendar at the moment, I assume it’s okay to do deployments unless brennen or someone else disagrees [14:03:05] (I’ll wait for a few minutes) [14:06:17] (03CR) 10Hashar: ci: Qemu image and snapshot creation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [14:06:26] (03PS8) 10Jbond: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [14:06:28] (03PS14) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [14:06:39] (03CR) 10Jbond: [C: 03+1] "i think this looks good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [14:07:15] (03CR) 10Jbond: [C: 03+1] Remove ArgparseFormatter as it's now unused [cookbooks] - 10https://gerrit.wikimedia.org/r/762860 (owner: 10Volans) [14:07:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:02] alright, let’s do the config changes then [14:11:23] (03PS3) 10Lucas Werkmeister (WMDE): InitialiseSettings: General cleanup, wgAddGroups (R-Z) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:11:58] (waiting for the diffConfig build to finish before +2ing) [14:12:24] jouncebot: now [14:12:24] For the next 0 hour(s) and 47 minute(s): UTC evening backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1400) [14:12:42] hashar: I moved the window in the calendar yesterday, assuming that this is where it was supposed to be [14:12:50] per the discussion in Gerrit that assumption might have been wrong [14:12:57] but I’m assuming I can still do deployments now [14:13:17] Lucas_WMDE: yes it is all good :) [14:13:19] but if you disagree I can also hold off (nothing merged yet) [14:13:21] ok :) [14:13:34] I think the original intent for those windows were: [14:13:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [14:13:52] 1) ensure someone knowing how to deploky stuff is available as a service to other developers that don't know much about scap [14:13:57] (03CR) 10Jbond: [C: 04-1] "see comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [14:14:04] 2) avoid concurrent conflicting deployments [14:14:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig empty, good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:14:14] (03CR) 10Jbond: [C: 03+1] add Hannah Okwelum to platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/763456 (https://phabricator.wikimedia.org/T301876) (owner: 10ArielGlenn) [14:14:28] I am quite happy to let folks deploy out of window or adjust the window if needed :) [14:14:30] (03CR) 10Volans: [C: 03+2] Remove ArgparseFormatter as it's now unused [cookbooks] - 10https://gerrit.wikimedia.org/r/762860 (owner: 10Volans) [14:15:03] (03Merged) 10jenkins-bot: InitialiseSettings: General cleanup, wgAddGroups (R-Z) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763398 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:15:05] PROBLEM - Host prometheus6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:09] PROBLEM - Host bast6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:13] PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:15] PROBLEM - Host install6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:29] PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:33] nn1l2: change is on mwdebug1001, let’s test [14:15:39] ^ something going on in dc 6? [14:15:45] (don’t remember which one that is) [14:15:53] PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:58] drmrs [14:15:59] Lucas_WMDE: drmrs, please ignore [14:16:02] ok [14:16:41] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/763461 (owner: 10Majavah) [14:17:07] (03PS1) 10Hashar: Stop excluding the 'wikipedia' interwiki prefix [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763298 (https://phabricator.wikimedia.org/T301936) [14:17:23] ^^ I am adding this patch to the current window [14:17:31] ah, very good [14:17:56] I’ll sync the config change and then let you do that one, ok? [14:18:26] wikidata still works :) [14:18:33] syncing (I tested simplewiki ^^) [14:18:52] (03Merged) 10jenkins-bot: Remove ArgparseFormatter as it's now unused [cookbooks] - 10https://gerrit.wikimedia.org/r/762860 (owner: 10Volans) [14:19:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:763398|InitialiseSettings: General cleanup, wgAddGroups (R-Z) (T301647)]] (no-op) (duration: 00m 50s) [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:15] T301647: Clean up InitialiseSettings - https://phabricator.wikimedia.org/T301647 [14:19:30] hashar: want to self-service? [14:19:32] (03Abandoned) 10Hashar: Revert "Optimise Skin::getLanguages()" [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763294 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:19:58] Lucas_WMDE: sure I can self deploy ;) [14:20:04] will do whenever the other patches are complete [14:20:04] ok, over to you then [14:20:15] I’d say unbreak the train before the remaining config change [14:20:22] zabe: i am going to deploy the interwiki config change ;) [14:20:44] well the interwiki has been broken since yesterday so there is no rush [14:20:57] ok fine [14:21:05] but I think you can start the gate-and-submit for yours at least [14:21:08] (03PS8) 10Lucas Werkmeister (WMDE): Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [14:21:11] (03CR) 10Hashar: [C: 03+2] Stop excluding the 'wikipedia' interwiki prefix [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763298 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:21:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [14:21:15] true! done [14:21:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:36] nice [14:22:15] (03Merged) 10jenkins-bot: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [14:22:37] anoop: your zhwikinews change is on mwdebug1001, can you test it? [14:22:40] zabe: I am inclined toward a tiny logic change in the code but I have been too lazy to investigate! Daniel Kinzler hinted at a configuration issue and I am more than happy to remove a hack from interwikiDump.php :] We will see how it behaves once pulled on mwdebug1001 [14:22:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:22:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:12] (03Merged) 10jenkins-bot: Stop excluding the 'wikipedia' interwiki prefix [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763298 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:23:18] ok, working fine [14:23:26] yay [14:23:51] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 101.23 ms [14:23:53] RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 101.65 ms [14:23:53] RECOVERY - Host bast6001 is UP: PING OK - Packet loss = 0%, RTA = 105.57 ms [14:23:53] RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 101.69 ms [14:23:56] I also see something that looks like related pages at the bottom of zhwikinews [14:24:00] though I can’t read Chinese ^^ [14:24:03] RECOVERY - Host install6001 is UP: PING OK - Packet loss = 0%, RTA = 101.66 ms [14:24:05] sync running [14:24:13] Amir1: I am going to actually use your deploy-commands site ( https://deploy-commands.toolforge.org/bacc/763298 ). It is really a blessing ;] [14:24:13] RECOVERY - Host prometheus6001 is UP: PING OK - Packet loss = 0%, RTA = 101.92 ms [14:24:18] and drmrs is coming back how nice [14:24:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20980 and previous config saved to /var/cache/conftool/dbconfig/20220217-142427-kormat.json [14:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:32] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:24:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:762761|Enable RelatedArticles for desktop (non-mobile) view at zhwikinews (T299856)]] (duration: 00m 49s) [14:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:37] hashar: ^^ [14:24:37] T299856: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews - https://phabricator.wikimedia.org/T299856 [14:24:40] hashar: all yours [14:25:01] looks like it already merged, nice [14:25:03] the sync summary is the most useful part for me [14:25:05] * hashar copy paste [14:25:49] hmm [14:25:59] somehow that rebased Echo and GrowthExperiments [14:26:21] ah local patches [14:26:39] mhm [14:26:44] patch on mwdebug1001 [14:28:03] I think due to security patches [14:28:16] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) Very interesting use case in https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113, namely Beta/deployment-prep. We have two sets of VMs: * Kafka logging (c... [14:29:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:06] zabe: patch is on mwdebug1001. I am testing it [14:30:26] oh no [14:30:31] I have to regenerate the interwikidump [14:30:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:30:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:47] yep [14:31:15] also syncing out the wikimediamaintenance patch should do nothing, so there is no real risk [14:31:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/WikimediaMaintenance/dumpInterwiki.php: Backport: [[gerrit:763298|Stop excluding the 'wikipedia' interwiki prefix (T301936)]] (duration: 00m 48s) [14:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] T301936: Interwiki prefix "wikipedia" not working on multilingual wikimedia projects - https://phabricator.wikimedia.org/T301936 [14:34:54] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for durum [puppet] - 10https://gerrit.wikimedia.org/r/763509 (owner: 10Muehlenhoff) [14:35:41] (03PS1) 10Hashar: Regen interwiki cache to drop erroneous 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763516 (https://phabricator.wikimedia.org/T301936) [14:35:47] zabe: ^ ;) [14:37:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Regen interwiki cache to drop erroneous 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763516 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:37:26] (03CR) 10Hashar: [C: 03+2] Regen interwiki cache to drop erroneous 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763516 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:37:36] going to copy paste from https://deploy-commands.toolforge.org/bacc/763516 [14:37:39] thx Lucas_WMDE ! [14:37:41] (03CR) 10Zabe: [C: 03+1] Regen interwiki cache to drop erroneous 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763516 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:38:08] (03Merged) 10jenkins-bot: Regen interwiki cache to drop erroneous 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763516 (https://phabricator.wikimedia.org/T301936) (owner: 10Hashar) [14:38:43] testing on mwdebug1001 [14:39:32] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20981 and previous config saved to /var/cache/conftool/dbconfig/20220217-143931-kormat.json [14:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:44] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [14:39:48] (ProbeHttpFailed) firing: (22) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [14:41:20] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:41:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:18] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@3a25565]: (no justification provided) [14:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:29] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [14:42:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:42:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey can you please get me the Partitioning/Raid information? Thanks [14:44:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] hashar: lgtm, after purging the interwiki links are working for me [14:44:22] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@3a25565]: (no justification provided) (duration: 02m 04s) [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:42] zabe: yes indeed [14:44:51] (03CR) 10Volans: "replies inline, follow up PS coming shortly" [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [14:44:55] I have replied on the task with all the tests I have made [14:45:01] some page will have to be purged I believe [14:45:14] thank you very much for the patch to WikimediaMaintenance and the analyzis! [14:45:23] !log hashar@deploy1002 Synchronized wmf-config/interwiki.php: Config: [[gerrit:763516|Regen interwiki cache to drop erroneous 'wikipedia' (T301936)]] (duration: 00m 48s) [14:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:28] T301936: Interwiki prefix "wikipedia" not working on multilingual wikimedia projects - https://phabricator.wikimedia.org/T301936 [14:45:33] PROBLEM - Host asw1-b12-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:45:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [14:45:35] PROBLEM - Host asw1-b13-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:45:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10elukey) @Papaul Hi! IIRC these nodes have two 2TB disks, so I'd go for the standard raid1 recipe: `echo partman/standard.cfg partman/raid1-2dev` Lemme... [14:46:55] yw [14:47:20] jouncebot: now [14:47:20] For the next 0 hour(s) and 12 minute(s): UTC evening backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1400) [14:47:23] {{success}} [14:47:36] !log UTC evening backport and config training has completed. [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:05] imported openjdk-8 8u322-b06-1~deb11u1 for bullseye-wikimedia (forward port of latest Java 8 security fixes) [14:48:13] (03CR) 10JHathaway: R:varnish:instance: Add hiere key to control cloud ratelimits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:48:31] PROBLEM - Host asw1-b13-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:57] PROBLEM - Host asw1-b12-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:15] (03PS1) 10Elukey: install_server: add partman recipe for ml-cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/763518 (https://phabricator.wikimedia.org/T299433) [14:49:29] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:44] (03PS9) 10JHathaway: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) [14:50:39] (03CR) 10Elukey: [C: 03+2] install_server: add partman recipe for ml-cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/763518 (https://phabricator.wikimedia.org/T299433) (owner: 10Elukey) [14:51:08] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [14:51:17] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763370 (owner: 10JHathaway) [14:51:27] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [14:51:41] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [14:52:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10elukey) Went ahead and merged the change, I've also ran puppet across install nodes, so you can install the os whenever you want :) [14:52:30] (03PS4) 10Volans: sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 [14:53:36] (JobUnavailable) firing: (4) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:53:49] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [14:53:53] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [14:53:58] (ProbeHttpFailed) resolved: (22) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [14:54:37] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P20982 and previous config saved to /var/cache/conftool/dbconfig/20220217-145436-kormat.json [14:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:47] (03PS4) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [14:59:37] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.64 ms [14:59:39] RECOVERY - Host asw1-b12-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.51 ms [15:00:19] RECOVERY - Host asw1-b13-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.51 ms [15:01:13] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:29] RECOVERY - Host asw1-b13-drmrs.wikimedia.org IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.56 ms [15:01:55] RECOVERY - Host asw1-b12-drmrs.wikimedia.org IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms [15:05:25] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) Hi, sorry for the slow response. The swift and `S3` endpoints are not available externally; instead the edge caches reverse-proxy e.g. upload.wikimedia.org to swift.discover... [15:06:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey thanks [15:09:41] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20983 and previous config saved to /var/cache/conftool/dbconfig/20220217-150941-kormat.json [15:09:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:47] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:09:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:09:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:10:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20984 and previous config saved to /var/cache/conftool/dbconfig/20220217-151021-kormat.json [15:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:55] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10jbond) >>! In T300130#7718276, @elukey wrote: > but in theory it should be possible to point the logging project hosts to it via `profile::pki::client`. that and adding t... [15:14:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:15:39] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.69 ms [15:20:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1012.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:20:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1012.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [15:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:06] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [15:23:09] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1012 [15:23:41] !log imported openjdk-8 8u322-b06-1~deb11u1 for bullseye-wikimedia (forward port of latest Java 8 security fixes) [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:49] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 2118 MB (3% inode=98%): /tmp 2118 MB (3% inode=98%): /var/tmp 2118 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [15:25:35] (03PS9) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [15:26:18] (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:26:19] moritzm: we cannot let go Java 8 :D [15:26:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on testvm[2001-2003].codfw.wmnet with reason: Instance restarts [15:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on testvm[2001-2003].codfw.wmnet with reason: Instance restarts [15:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:25] yeah :-) [15:33:10] (03CR) 10Jbond: [C: 03+1] "LGTM, assuming nothing wired in pcc" [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [15:34:56] 10SRE-swift-storage, 10Observability-Logging, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10MatthewVernon) Having a quick look at the logrotate configuration, it has ` /srv/log/swift/*.log { [...] postrotate service rs... [15:35:42] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20986 and previous config saved to /var/cache/conftool/dbconfig/20220217-153542-kormat.json [15:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:48] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:36:02] (03CR) 10BBlack: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [15:37:01] (03CR) 10Jbond: "LGTM but lets also check with pcc" [puppet] - 10https://gerrit.wikimedia.org/r/763370 (owner: 10JHathaway) [15:39:17] (03CR) 10BBlack: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [15:41:36] (03CR) 10Jbond: [C: 03+1] "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1001/33833/" [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [15:41:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:56] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [15:42:13] 10SRE-swift-storage, 10Observability-Logging, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10MatthewVernon) ` swift (2.26.0-7) unstable; urgency=medium * Fix logging and logrotate to do like all the other OpenStack daemons. ` F... [15:45:30] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [15:46:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33835/console" [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [15:47:07] (03PS1) 10Elukey: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763539 (https://phabricator.wikimedia.org/T300744) [15:47:09] (03PS1) 10Elukey: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763540 (https://phabricator.wikimedia.org/T300744) [15:49:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet [15:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:29] (03CR) 10JHathaway: Remove ordered_json function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [15:50:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P20987 and previous config saved to /var/cache/conftool/dbconfig/20220217-155047-kormat.json [15:50:50] (03CR) 10Andrew Bogott: backy2: on Bullseye, hack around a silly package name mismatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [15:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:06] (03PS2) 10David Caro: mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) [15:54:08] (03CR) 10David Caro: mcrouter::monitoring: remove module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:54:10] (03PS15) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [15:54:24] (03PS1) 10MVernon: swift: use rsyslog-rotate to get rsyslog to close old files [puppet] - 10https://gerrit.wikimedia.org/r/763541 (https://phabricator.wikimedia.org/T301657) [15:55:04] (03PS3) 10David Caro: parsoid: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) [15:56:27] (03CR) 10Hashar: "I have made the last qemu-img create to preallocate disk space with:" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [15:56:31] (03PS3) 10Andrew Bogott: backy2: initialize backy2 database if necessary [puppet] - 10https://gerrit.wikimedia.org/r/763401 [15:56:33] (03PS4) 10Andrew Bogott: backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 [15:56:35] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33836/console" [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [16:00:07] (03CR) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [16:03:09] (03PS2) 10Elukey: install_server: set new partman recipe for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763539 (https://phabricator.wikimedia.org/T300744) [16:03:11] (03PS2) 10Elukey: Add overlayfs settings for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763540 (https://phabricator.wikimedia.org/T300744) [16:05:08] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10Zabe) Since there seems to be a valid, I did the same mitigation as in T271808 and T293070. ` root@deployment-cache-up... [16:05:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P20988 and previous config saved to /var/cache/conftool/dbconfig/20220217-160551-kormat.json [16:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:34] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10AlexisJazz) >>! In T301995#7718593, @Zabe wrote: > Since there seems to be a valid certificate, I did the same mitigat... [16:08:23] (03CR) 10JMeybohm: [C: 03+1] install_server: set new partman recipe for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763539 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:09:36] (03CR) 10Elukey: [C: 03+2] install_server: set new partman recipe for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763539 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:09:39] (03CR) 10JMeybohm: [C: 03+1] Add overlayfs settings for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763540 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:11:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33837/console" [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [16:17:43] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763540 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:18:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33839/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [16:18:35] (03CR) 10Jbond: [V: 03+1 C: 03+1] "not blocking but would also be nice to have tests for theses (perhaps can be handled when moving to namespaced functions)" [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [16:18:42] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33838/console" [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [16:19:44] (03CR) 10Jbond: [V: 03+1 C: 03+1] Remove ordered_json function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [16:20:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bullseye [16:20:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300774)', diff saved to https://phabricator.wikimedia.org/P20989 and previous config saved to /var/cache/conftool/dbconfig/20220217-162056-kormat.json [16:20:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P20990 and previous config saved to /var/cache/conftool/dbconfig/20220217-162104-kormat.json [16:21:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:26] (03CR) 10EllenR: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [16:21:34] (03PS1) 10Ayounsi: Add drmrs routers [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) [16:21:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [16:22:24] (03CR) 10jerkins-bot: [V: 04-1] Add drmrs routers [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) (owner: 10Ayounsi) [16:22:49] (03PS7) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [16:23:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:23:56] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [16:25:17] (03CR) 10Jbond: [C: 03+1] mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:25:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:25:31] (03PS2) 10Ayounsi: Add drmrs routers [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) [16:26:46] (03Merged) 10jenkins-bot: sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 (owner: 10Volans) [16:27:51] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:22] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:30:26] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:31:06] (03PS2) 10MVernon: swift: use rsyslog-rotate to get rsyslog to close old files [puppet] - 10https://gerrit.wikimedia.org/r/763541 (https://phabricator.wikimedia.org/T301657) [16:31:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10JBennett) Approved. [16:32:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10JBennett) approved [16:32:52] (03PS1) 10Accraze: ml-services: add elwiki, enwiktionary, eswikibooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/763556 (https://phabricator.wikimedia.org/T301415) [16:33:17] (03PS12) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [16:33:46] (03CR) 10jerkins-bot: [V: 04-1] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [16:35:18] (03CR) 10David Caro: [C: 03+2] mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:35:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:38:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:38:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] pybal::testing: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751709 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:39:26] (03CR) 10David Caro: [C: 03+2] pybal::testing: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751709 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:39:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [16:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:40:26] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:40:56] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:41:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:42:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [16:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1002.eqiad.wmnet [16:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:18] (03CR) 10David Caro: [C: 03+2] profile::nutcracker: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751701 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:43:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:45:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:45:56] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:46:56] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:47:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:47] (03PS1) 10Ladsgroup: Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763302 [16:49:05] (03PS2) 10Ladsgroup: Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763302 [16:49:08] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763302 (owner: 10Ladsgroup) [16:50:13] (03PS5) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [16:50:15] (03PS1) 10Giuseppe Lavagetto: [draft] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [16:50:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:51:24] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/763541 (https://phabricator.wikimedia.org/T301657) (owner: 10MVernon) [16:51:49] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) @wiki_willy @RobH we need cables for old cage to finish connection 4x 20m. SC/LC fibers 8x 40GBaseLR optics 8x 10GBase-LR 4x 15m SC/LC fibers [16:51:56] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:52:05] this is me reimaging --^ [16:52:56] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:53:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:53:56] (03PS1) 10David Caro: Remove unused module xvfb [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) [16:55:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:58:44] 10SRE-swift-storage, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10MatthewVernon) To answer my own question, the bullseye version in the package is using `copytruncate`, which copies... [17:00:04] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:26] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [17:02:17] (03CR) 10Cwhite: [C: 03+2] grafana-next: set grafana codfw base domain to grafana next [puppet] - 10https://gerrit.wikimedia.org/r/763329 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [17:05:25] (03PS1) 10Elukey: Change docker package name for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763562 (https://phabricator.wikimedia.org/T300744) [17:09:27] !log stop advertising drmrs from esams [17:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:52] (03PS1) 10Vgutierrez: prometheus: Aggreation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) [17:11:31] !log razzi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host datahubsearch1002.eqiad.wmnet [17:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:57] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1002.eqiad.wmnet [17:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:07] (03CR) 10Elukey: [C: 03+2] Change docker package name for kubestage2001 [puppet] - 10https://gerrit.wikimedia.org/r/763562 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:12:55] (03PS1) 10Ladsgroup: db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763568 (https://phabricator.wikimedia.org/T300510) [17:14:04] (03PS2) 10Ladsgroup: db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763568 (https://phabricator.wikimedia.org/T300510) [17:16:10] (03PS3) 10Ayounsi: Add drmrs routers [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) [17:17:56] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:18:00] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:19:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bullseye [17:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1105: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763568 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [17:19:47] (03CR) 10Ayounsi: [C: 03+2] "Already pushed from my laptop, merging to not have diff with esams. Feel free to leave post-merge comments if needed." [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) (owner: 10Ayounsi) [17:19:52] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc [17:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:25] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P20991 and previous config saved to /var/cache/conftool/dbconfig/20220217-172124-kormat.json [17:21:26] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [17:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:31] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:21:57] (03Merged) 10jenkins-bot: Add drmrs routers [homer/public] - 10https://gerrit.wikimedia.org/r/763551 (https://phabricator.wikimedia.org/T300277) (owner: 10Ayounsi) [17:24:41] (03CR) 10David Caro: backy2: don't back up shelved instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763345 (owner: 10Andrew Bogott) [17:24:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:24:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:25:00] (03PS7) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [17:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300510)', diff saved to https://phabricator.wikimedia.org/P20992 and previous config saved to /var/cache/conftool/dbconfig/20220217-172504-ladsgroup.json [17:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [17:25:20] (03CR) 10Razzi: [V: 03+1] "Updated the patch, still going with a 1-node cluster as I spin up the other machines, let me know if that's alright for now." [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [17:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20993 and previous config saved to /var/cache/conftool/dbconfig/20220217-172650-ladsgroup.json [17:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:28] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [17:29:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1105.eqiad.wmnet with OS bullseye [17:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:33:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:35:12] (03PS1) 10Ebernhardson: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) [17:36:30] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P20994 and previous config saved to /var/cache/conftool/dbconfig/20220217-173630-kormat.json [17:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:46] (03CR) 10jerkins-bot: [V: 04-1] cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [17:39:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1105.eqiad.wmnet with reason: host reimage [17:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:57] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) Current status: * Physical work left (I'll give the details tomorrow): ** Planned: move Telia's link to the routers now that w... [17:41:01] (03CR) 10Razzi: [C: 03+2] analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [17:42:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1105.eqiad.wmnet with reason: host reimage [17:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] (03CR) 10Herron: [C: 03+1] remove deprecated piechart plugin [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763334 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [17:42:34] (03CR) 10Herron: [C: 03+1] update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [17:42:50] (03CR) 10Herron: [C: 03+1] update grafana-image-renderer to 3.3.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [17:43:08] (03PS1) 10Razzi: datahub::opensearch: Fix sdd typo to be ssd [puppet] - 10https://gerrit.wikimedia.org/r/763575 (https://phabricator.wikimedia.org/T301382) [17:44:34] (03CR) 10Razzi: [C: 03+2] datahub::opensearch: Fix sdd typo to be ssd [puppet] - 10https://gerrit.wikimedia.org/r/763575 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [17:45:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:50:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:50:51] 10ops-eqiad, 10DC-Ops: cloudvirt1017.mgmt/SSH - https://phabricator.wikimedia.org/T302016 (10mdipietro) [17:51:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P20995 and previous config saved to /var/cache/conftool/dbconfig/20220217-175135-kormat.json [17:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:53:28] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Still troubleshooting mariadb issues [17:53:28] !log razzi@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Still troubleshooting mariadb issues [17:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:44] (03PS2) 10Elukey: ml-services: add elwiki, enwiktionary, eswikibooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/763556 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [17:53:46] (03PS1) 10Elukey: kserve-inference: improve the revscoring_inference_service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/763580 [17:54:10] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on datahubsearch1001.eqiad.wmnet with reason: Node is being set up for first time and puppet run failed [17:54:12] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on datahubsearch1001.eqiad.wmnet with reason: Node is being set up for first time and puppet run failed [17:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:38] (03PS13) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [17:54:52] (03CR) 10Andrew Bogott: [C: 03+2] backy2: initialize backy2 database if necessary [puppet] - 10https://gerrit.wikimedia.org/r/763401 (owner: 10Andrew Bogott) [17:55:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:56:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1105.eqiad.wmnet with OS bullseye [17:56:13] (03CR) 10jerkins-bot: [V: 04-1] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [17:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:24] (03PS3) 10Elukey: ml-services: add elwiki, enwiktionary, eswikibooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/763556 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [17:57:23] (03PS5) 10Andrew Bogott: backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 [18:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1800). [18:06:40] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P20997 and previous config saved to /var/cache/conftool/dbconfig/20220217-180639-kormat.json [18:06:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:06:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:46] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [18:06:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300774)', diff saved to https://phabricator.wikimedia.org/P20998 and previous config saved to /var/cache/conftool/dbconfig/20220217-180647-kormat.json [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:56] (03PS14) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [18:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:23] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300510)', diff saved to https://phabricator.wikimedia.org/P20999 and previous config saved to /var/cache/conftool/dbconfig/20220217-180900-ladsgroup.json [18:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:05] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [18:10:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:11:01] (03PS10) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [18:11:45] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:13:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:16:16] (03PS2) 10ArielGlenn: add Hannah Okwelum to platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/763456 (https://phabricator.wikimedia.org/T301876) [18:17:09] (03CR) 10Accraze: [C: 03+1] kserve-inference: improve the revscoring_inference_service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/763580 (owner: 10Elukey) [18:18:18] (03CR) 10ArielGlenn: [C: 03+2] add Hannah Okwelum to platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/763456 (https://phabricator.wikimedia.org/T301876) (owner: 10ArielGlenn) [18:18:44] (03PS2) 10Elukey: kserve-inference: improve the revscoring_inference_service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/763580 [18:18:46] (03PS4) 10Elukey: ml-services: add elwiki, enwiktionary, eswikibooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/763556 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [18:23:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:23:59] (03CR) 10Elukey: [C: 03+2] kserve-inference: improve the revscoring_inference_service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/763580 (owner: 10Elukey) [18:24:04] (03CR) 10Elukey: [C: 03+2] ml-services: add elwiki, enwiktionary, eswikibooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/763556 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [18:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21000 and previous config saved to /var/cache/conftool/dbconfig/20220217-182405-ladsgroup.json [18:24:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21001 and previous config saved to /var/cache/conftool/dbconfig/20220217-182405-kormat.json [18:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:31:35] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [18:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:06] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) I can put in a followup ticket for them to correct the 'unplanned' items but I'll wait until you finish your setup or give the go a... [18:33:14] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) a:05RobHβ†’03ayounsi [18:34:39] !log accraze@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [18:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:12] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) No packing slips in box according to the Interxion engineer who did our remote hands work, so I've requested a copy from Myriad so... [18:36:30] (03PS1) 10Papaul: Add restbase-dev200[1-3] to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763583 (https://phabricator.wikimedia.org/T299437) [18:36:55] (03CR) 10Andrew Bogott: [C: 03+2] backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 (owner: 10Andrew Bogott) [18:37:41] (03PS1) 10Majavah: toolsdb: enable pt-heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/763584 [18:38:35] (03CR) 10Papaul: [C: 03+2] Add restbase-dev200[1-3] to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763583 (https://phabricator.wikimedia.org/T299437) (owner: 10Papaul) [18:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21002 and previous config saved to /var/cache/conftool/dbconfig/20220217-183909-ladsgroup.json [18:39:10] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21003 and previous config saved to /var/cache/conftool/dbconfig/20220217-183910-kormat.json [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33841/console" [puppet] - 10https://gerrit.wikimedia.org/r/763584 (owner: 10Majavah) [18:49:05] (03PS1) 10Razzi: datahub::opensearch: Change curator version to 5.8.1-1 for [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) [18:49:19] (03PS2) 10Razzi: datahub::opensearch: Change curator version to 5.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) [18:50:11] (03CR) 10jerkins-bot: [V: 04-1] datahub::opensearch: Change curator version to 5.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [18:50:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase-dev2001.codfw.wmnet with OS buster [18:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, and 2 others: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase-dev2001.codfw.w... [18:52:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) @hnowlan I am having issues with partman for these servers. Can you verify t... [18:52:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300510)', diff saved to https://phabricator.wikimedia.org/P21004 and previous config saved to /var/cache/conftool/dbconfig/20220217-185414-ladsgroup.json [18:54:15] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300774)', diff saved to https://phabricator.wikimedia.org/P21005 and previous config saved to /var/cache/conftool/dbconfig/20220217-185414-kormat.json [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:21] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [18:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:26] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [18:59:06] 10SRE, 10ops-eqiad, 10DC-Ops: cloudvirt1017.mgmt/SSH - https://phabricator.wikimedia.org/T302016 (10Cmjohnson) This will require either a hard reboot/power off or replacing the cable. I will attempt the cable first. [18:59:48] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [19:00:04] hashar and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T1900). [19:02:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:02] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) @cmooney don't forget that 1012 is in the new cage, it could take awhile to get that going. [19:04:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase-dev2002.codfw.wmnet with OS buster [19:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, and 2 others: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase-dev2002.codfw.w... [19:07:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P21006 and previous config saved to /var/cache/conftool/dbconfig/20220217-190748-ladsgroup.json [19:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:53] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [19:08:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase-dev2001.codfw.wmnet with reason: host reimage [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase-dev2001.codfw.wmnet with reason: host reimage [19:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:15:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:18:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:20:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase-dev2001.codfw.wmnet with OS buster [19:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase-dev2001.codfw.wmnet... [19:21:51] (03PS3) 10Cathal Mooney: New function and changes to wmf-netbox plugin to support EVPN config. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) [19:22:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase-dev2002.codfw.wmnet with reason: host reimage [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21007 and previous config saved to /var/cache/conftool/dbconfig/20220217-192252-ladsgroup.json [19:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:24:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase-dev2003.codfw.wmnet with OS buster [19:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase-dev2003.codfw.w... [19:25:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:26:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase-dev2002.codfw.wmnet with reason: host reimage [19:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:30:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [19:33:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:35:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:35:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase-dev2002.codfw.wmnet with OS buster [19:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase-dev2002.codfw.wmnet... [19:37:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21008 and previous config saved to /var/cache/conftool/dbconfig/20220217-193757-ladsgroup.json [19:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase-dev2003.codfw.wmnet with reason: host reimage [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase-dev2003.codfw.wmnet with reason: host reimage [19:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P21009 and previous config saved to /var/cache/conftool/dbconfig/20220217-195302-ladsgroup.json [19:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [19:54:21] (03CR) 10JHathaway: Remove ordered_yaml function (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [19:54:53] (03CR) 10JHathaway: [C: 03+2] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [19:54:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase-dev2003.codfw.wmnet with OS buster [19:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase-dev2003.codfw.wmnet... [19:57:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) [20:02:28] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@66350a9]: (no justification provided) [20:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:31] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@66350a9]: (no justification provided) (duration: 02m 02s) [20:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:06:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) 05Openβ†’03Resolved @hnowlan this is complete [20:07:24] (03PS3) 10JHathaway: Remove ordered_json function [puppet] - 10https://gerrit.wikimedia.org/r/763309 [20:09:51] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01239 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:10:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:11:01] (03CR) 10Cwhite: [C: 04-1] "One of two ways is better, IMHO:" [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [20:15:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:17:31] (03PS2) 10JHathaway: Remove puppet:///files and move files to modules [puppet] - 10https://gerrit.wikimedia.org/r/763370 [20:20:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:23:26] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:15] (03CR) 10JHathaway: [C: 03+2] Remove ordered_json function [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [20:34:33] (03CR) 10JHathaway: [C: 03+2] Remove ordered_json function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763309 (owner: 10JHathaway) [20:38:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:43:03] (03PS1) 10Ssingh: dnsrecursor: allow outgoing IPv6 queries [puppet] - 10https://gerrit.wikimedia.org/r/763593 [20:43:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33845/console" [puppet] - 10https://gerrit.wikimedia.org/r/763593 (owner: 10Ssingh) [20:44:47] (03CR) 10Ssingh: [V: 03+1] "NOOP on existing hosts, as expected." [puppet] - 10https://gerrit.wikimedia.org/r/763593 (owner: 10Ssingh) [20:45:01] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: allow outgoing IPv6 queries [puppet] - 10https://gerrit.wikimedia.org/r/763593 (owner: 10Ssingh) [20:51:13] (03PS1) 10Ssingh: P:wikidough: enable IPv6 in backend recursor [puppet] - 10https://gerrit.wikimedia.org/r/763595 [20:51:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33846/console" [puppet] - 10https://gerrit.wikimedia.org/r/763595 (owner: 10Ssingh) [20:53:51] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: enable IPv6 in backend recursor [puppet] - 10https://gerrit.wikimedia.org/r/763595 (owner: 10Ssingh) [20:55:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:55:52] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005382 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:58:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:59:08] (03PS3) 10JHathaway: Remove puppet:///files and move files to modules [puppet] - 10https://gerrit.wikimedia.org/r/763370 [21:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220217T2100). [21:00:05] eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:46] greetings everyone [21:03:30] (03CR) 10JHathaway: [C: 03+2] Remove puppet:///files and move files to modules [puppet] - 10https://gerrit.wikimedia.org/r/763370 (owner: 10JHathaway) [21:04:54] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:08:52] (03PS1) 10Jbond: Rakefile: Add sperate rake jobs for static/unit tests [puppet] - 10https://gerrit.wikimedia.org/r/763597 [21:09:13] (03PS3) 10Razzi: opensearch: make curator version bullseye compatible [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) [21:10:12] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01561 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:14:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) [21:15:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:15:37] (03CR) 10Hashar: [C: 03+1] "I clearly have missed some cleaning up steps. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [21:19:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [21:19:51] !log razzi@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=93) for new host datahubsearch1002.eqiad.wmnet [21:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:05] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:27] (03CR) 10Razzi: "Updated the patch." [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:23:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:25:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:27:30] (03CR) 10Cwhite: [C: 03+1] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/33848/" [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:28:24] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) all 3 are completed [21:34:59] 10ops-eqiad, 10DC-Ops: eqiad: Unrack wmf3570 & wmf4579 - https://phabricator.wikimedia.org/T302034 (10wiki_willy) [21:38:52] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002691 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:50:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:51:43] greetings team, any updates on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/762881 [21:53:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:54:53] eigyan: it doesn't look like Jon's CR was fixed [21:55:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:55:37] "would suggest setting coverage to 1 as coverage is all users, not % of users who meet the other criteria. [21:55:37] 10% of users is likely too low given you are targetting minEdits of 5 on Farsi Wikipedia." [22:00:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:01:02] Thank you RhinosF1, the coverage is mandated to that specific audience [22:01:21] by request of Trust and Safety stakeholders [22:01:31] whom I am making this edit for [22:02:15] Does that make sense? [22:05:57] RhinosF1is your suggestion that I get approval from Trust and Safety stakeholders to change coverage and resubmit? [22:05:57] the concern is that it may not may not do what you expect [22:06:03] eigyan: you still need to respond to Jon [22:06:08] You cant simply ignore it [22:06:31] i don't care who told you to do it, I care that someone has raised a concern that hasn't been addressed [22:07:19] I am only 6 months with the foundation and learning new things each deploy, my apologies, I thought that was more context, not a mandate [22:07:33] (03CR) 10RhinosF1: [C: 04-1] "please answer Jon's suggested change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [22:07:49] Jdlrobson: fyi ^ [22:08:23] eigyan: I don't know of any organisation ever that's going to allow you to simply ignore a code review without further comment because so and so said [22:08:57] Sure thing [22:09:16] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:30] eigyan: if you are confused about a code review, you need to make that clear on the change. We're all here to help but we can only do that if you're honest and upfront with us. [22:11:43] I do apologise for the fact that no one answered during the window though [22:11:47] Sometimes people get busy [22:11:48] (03PS1) 10Andrew Bogott: Openstack Cinder and Nova: tweak cgroup kernel settings on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/763605 (https://phabricator.wikimedia.org/T281276) [22:11:51] (03CR) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [22:12:00] I think you missed the jouncebot ping by a few seconds [22:12:25] hey RhinosF1 and eigyan :) [22:12:29] you can always ping a deployer if no one shows to double check [22:12:32] Hey mepps [22:12:54] (03PS3) 10JHathaway: ini(), php_ini(): convert to modern Ruby function API [puppet] - 10https://gerrit.wikimedia.org/r/763311 (https://phabricator.wikimedia.org/T265138) [22:13:01] yeah the call for coverage set to 0.1 was a stakeholder decision [22:13:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:13:32] mepps: can we name stakeholders [22:13:47] Or are they able to explain why [22:14:20] in the patch Rhinosf1 or here? i believe it would be TAndic, let me make sure she's on gerrit too [22:14:32] mepps: on the patch is best [22:14:37] For transparency [22:15:05] sounds good RhinosF1, i'm also trying to see if it was documented in phab [22:15:08] You'll have to reschedule the change anyway because a) its C-1'd and b) no one was around to deploy [22:15:17] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10cmooney) [22:15:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:16:02] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:38] * RhinosF1 is not seeing a gerrit account [22:17:25] (03CR) 10Dzahn: "yea, eh,, see my original comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/597016 I wasn't sure back then either but have " [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [22:18:47] Ok team, thanks for the feedback. Is it ok to sign off now? [22:19:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763605 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [22:20:15] eigyan: no [22:20:20] (03CR) 10Mepps: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [22:20:21] Not until jon responds [22:20:31] Cool [22:20:32] mepps: hi! checked the user name, she has "TAndic" [22:20:32] they'll be no deploys until Monday now [22:20:40] Cool [22:20:41] RhinosF1 Is jon around? [22:20:48] mutante: I tried to request their review and couldn't find them [22:20:48] mepps: or .. I mean.. she could use that to login without having to register anything [22:20:59] it's the same as the wikitech user [22:21:02] mepps: he was pinged here, so...likely not at IRC at least [22:21:10] thanks mutante [22:21:18] yw [22:21:19] mepps: I have pinged him earlier up, he'll show if he is. I also brought the patch to his attention on gerrit. [22:21:41] thanks RhinosF1, sounds like we need to pause for the night and reschedule for another deploy window [22:21:48] Yes [22:21:59] The window closed anyway a bit ago [22:22:03] the window's officially over too (sorry, didn't get here earlier) [22:22:10] It would have to be Monday as we don't deploy on Fridays [22:22:29] Thanks RhinosF1 and urbanecm [22:22:57] Np, happy to give advice [22:23:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:24:59] Can I help with anything else, I understand the deploy is not happening. [22:25:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:25:42] I'm around a bit longer if you have questions eigyan [22:25:45] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:51] In general about deploys [22:26:02] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:27] the "no deploys on Friday rule" is not just us but more like a general industry thing [22:27:37] Thanks RhinosF1 I am going to sign off for the night. Thanks for everyones help. [22:27:44] Okay [22:27:48] Have a good weekend [22:27:58] cheers eigyan [22:28:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:13] fyi i just merged a patch on cookbook to remove datahubsearche1002 [22:30:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:31:57] razzi: ^ papaul's message is for you ... I guess [22:32:55] papaul: I imagine expected because the spelling error [22:33:12] It should be datahubsearch without the last E I think [22:33:51] (03PS1) 10JHathaway: WIP: Deprecated types :( [puppet] - 10https://gerrit.wikimedia.org/r/763611 [22:34:01] RhinosF1: indeed withotu the last E [22:34:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: Deprecated types :( [puppet] - 10https://gerrit.wikimedia.org/r/763611 (owner: 10JHathaway) [22:34:46] (03CR) 10JHathaway: [C: 03+2] ini(), php_ini(): convert to modern Ruby function API [puppet] - 10https://gerrit.wikimedia.org/r/763311 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [22:35:05] (03CR) 10JHathaway: [C: 03+2] ini(), php_ini(): convert to modern Ruby function API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763311 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [22:35:34] papaul: I left a message in their channel too [22:36:05] (03PS1) 10Andrew Bogott: dnsrecursor: change webserver listening address [puppet] - 10https://gerrit.wikimedia.org/r/763612 (https://phabricator.wikimedia.org/T300254) [22:38:10] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:40:34] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:41] (03CR) 10Andrew Bogott: [C: 03+2] dnsrecursor: change webserver listening address [puppet] - 10https://gerrit.wikimedia.org/r/763612 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [22:43:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:47:46] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:50:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:51:24] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:51:47] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Cinder and Nova: tweak cgroup kernel settings on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/763605 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [22:53:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:53:15] papaul: razzi confirmed it fine [22:55:13] yep, sorry for any confusion, tried to create a datahubsearch1002 earlier but the command froze, and I never circled back around to clean up [22:57:26] The extra "e" in the name confuses me however, looking in my command history I never had that typo in the name [23:00:08] razzi: I think the typo was just here on IRC. but thing is..it's somehow not in DNS with either variant ..it seems [23:00:45] even though we would expect it to be now after papaul merged that [23:01:22] And I saw your change at https://phabricator.wikimedia.org/rONED7016edd1945493000dcc877db6f2f56509d5cdf5 and copied it from there [23:01:58] razzi: nevermind, it works when I try from another host [23:02:07] datahubsearch1002.eqiad.wmnet has address 10.64.16.38 [23:02:07] Host datahubsearch1002.eqiad.wmnet not found: 3(NXDOMAIN) [23:02:07] Host datahubsearch1002.eqiad.wmnet not found: 3(NXDOMAIN) [23:02:10] well.. partially [23:02:16] as if syncing was interrupted [23:02:55] Hm, ok thanks for that context mutante [23:03:39] try: dig datahubsearch1002.eqiad.wmnet @ns0.wikimedia.org and then replace ns0 with ns1 and ns2 [23:03:58] hmm.. maybe best to ask traffic to check the sync status [23:04:08] before messing with it [23:04:33] or run the DNS cookbook one more time [23:04:47] I never even installed an os on the vm, can definitely destroy it and start again with datahubsearch1003 [23:04:56] this could match what you said about the process hanging [23:05:00] PROBLEM - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: Text\sis\savailable\sunder\sthe\sa\srel=license\s+href=(https:)?\/\/en.wikipedia.org\/wiki\/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_LicenseCreative\sCommons\sAttribution-ShareAlike\sLicense/aa\srel=license\shref=\/\/creativecommons.org\/licenses\/by-sa\/3\.0/ html not found https://phabricator.wikimedia.org/project/members/28/ [23:05:15] ^ duh.. those are content checks on wiki [23:05:57] somebody edited a footer maybe [23:06:10] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:11] or the tickets had a 1 year downtime that just expired because those used to be open tickets [23:06:48] razzi: that's worth a try since it includes the DNS cookbook as well and good test if it happens again [23:06:48] (03PS2) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [23:07:10] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [23:07:30] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:07:30] 10SRE, 10DNS, 10Traffic, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10MRamirez_WMF) @jbond I have the records to finalize the transition TXT name Copy recordβ€Ž@β€Ž (or skip if not supported by provider) TXT value MS=ms70322281 TTL β€Ž3600β€Ž (or y... [23:07:34] (03CR) 10jerkins-bot: [V: 04-1] Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [23:10:01] yup, this is what cause the legal html alert https://en.wikipedia.org/w/index.php?title=MediaWiki:Wikimedia-copyright&diff=1072378717&oldid=861624479&diffmode=source [23:10:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:11:26] AntiComposite: should it be reverted? [23:11:38] probably not [23:11:55] CC license statements are supposed to have the version in them [23:12:44] ah, Thanks for that AntiComposite [23:12:53] then the monitoring needs to be adjusted [23:13:08] (03PS3) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [23:13:14] hardcoding a version isn't ideal for a check like that [23:13:44] (03CR) 10jerkins-bot: [V: 04-1] Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [23:14:33] (03CR) 10Subramanya Sastry: [C: 03+1] parsoid: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [23:15:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:17:16] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:33] 10SRE, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please... [23:18:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:18:05] (03CR) 10Jdlrobson: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [23:18:45] 10SRE, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) @Herald well, understood, but this is a an alert that Legal once requested to be notified on and it links to a Phab... [23:20:10] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:10] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) [23:20:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:21:06] mutante, don't know what complaining at Herald's going to do for you :) [23:21:12] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) [23:21:43] AntiComposite: :) gets it off my chest [23:21:57] but I take my comment back that "hardcoding a version is bad" [23:22:08] for a "legal check" like this maybe that is EXACTLY right [23:24:18] it was just a way to ask whether legal still wants that check and that phab workboard [23:24:29] since they have that herald rule [23:27:13] yeah, it looks like it's working as intended, as long as the alert is actually acted on [23:27:16] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:28:49] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) [23:32:14] ACKNOWLEDGEMENT - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: Text\sis\savailable\sunder\sthe\sa\srel=license\s+href=(https:)?\/\/en.wikipedia.org\/wiki\/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_LicenseCreative\sCommons\sAttribution-ShareAlike\sLicense/aa\srel=license\shref=\/\/creativecommons.org\/licenses\/by-sa\/3\.0/ html not found daniel_zahn https://phabricator.wikimedia.org/T302045 https:/ [23:32:14] ator.wikimedia.org/project/members/28/ [23:34:12] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:41] 10SRE, 10observability: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456 (10Dzahn) another one today because CC license version was changed to 3.0 created T302045 [23:35:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:37:01] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) related tickets: T108081 T119456 [23:40:56] (03CR) 10Dzahn: "adding Krinkle because of 09e84e65363e8e7c69ba28e8" [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [23:44:47] (03CR) 10Dzahn: "general comment: Just because wmcs itself does not use a specific class does not mean nobody in cloud VPS is using classes though. It can " [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [23:55:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org