[00:00:38] (03CR) 10jerkins-bot: [V: 04-1] static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [00:02:35] (03PS1) 10MewOphaswongse: Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) [00:02:51] (03PS2) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) [00:03:35] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Upon further testing, the startup script seems irrelevant. However, I have noticed a new problem: the service can't start after a reboot. HOWEVER, the service will sta... [00:04:05] (03CR) 10jerkins-bot: [V: 04-1] static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [00:05:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [00:06:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2079.mgmt.codfw.wmnet with reboot policy FORCED [00:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:49] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:18:45] (03PS3) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) [00:21:14] (03CR) 10jerkins-bot: [V: 04-1] Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [00:21:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2073.codfw.wmnet with OS bullseye [00:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2073.codfw.wmnet with OS bullseye [00:23:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2079.mgmt.codfw.wmnet with reboot policy FORCED [00:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:47] PROBLEM - puppet last run on thanos-be1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:27:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [00:28:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2074.codfw.wmnet with OS bullseye [00:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2074.codfw.wmnet with OS bullseye [00:38:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2073.codfw.wmnet with reason: host reimage [00:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2073.codfw.wmnet with reason: host reimage [00:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2074.codfw.wmnet with reason: host reimage [00:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2074.codfw.wmnet with reason: host reimage [00:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2073.codfw.wmnet with OS bullseye [00:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2073.codfw.wmnet with OS bullseye comple... [00:53:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2075.codfw.wmnet with OS bullseye [00:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2075.codfw.wmnet with OS bullseye [00:59:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2074.codfw.wmnet with OS bullseye [00:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2074.codfw.wmnet with OS bullseye comple... [01:00:05] twentyafterfour: I, the Bot under the Fountain, call upon thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0100). [01:01:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2075.codfw.wmnet with OS bullseye [01:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2075.codfw.wmnet with OS bullseye [01:01:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2075.codfw.wmnet with OS bullseye [01:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2075.codfw.wmnet with OS bullseye execut... [01:05:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcontrol1004, deneb, elastic2074, cloudcontrol1003, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [01:08:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2076.codfw.wmnet with OS bullseye [01:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2076.codfw.wmnet with OS bullseye [01:10:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2075.codfw.wmnet with reason: host reimage [01:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:53] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2075.codfw.wmnet with reason: host reimage [01:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2076.codfw.wmnet with reason: host reimage [01:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2075.codfw.wmnet with OS bullseye [01:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2075.codfw.wmnet with OS bullseye comple... [01:27:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2077.codfw.wmnet with OS bullseye [01:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2077.codfw.wmnet with OS bullseye [01:28:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2076.codfw.wmnet with reason: host reimage [01:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:32] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:38:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2076.codfw.wmnet with OS bullseye [01:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2076.codfw.wmnet with OS bullseye comple... [01:40:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2078.codfw.wmnet with OS bullseye [01:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2078.codfw.wmnet with OS bullseye [01:40:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:44:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2077.codfw.wmnet with reason: host reimage [01:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2077.codfw.wmnet with reason: host reimage [01:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2078.codfw.wmnet with reason: host reimage [01:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2077.codfw.wmnet with OS bullseye [01:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2077.codfw.wmnet with OS bullseye comple... [02:00:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2078.codfw.wmnet with reason: host reimage [02:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2078.codfw.wmnet with OS bullseye [02:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2078.codfw.wmnet with OS bullseye comple... [02:16:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [02:25:32] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 80 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:36:43] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 61 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:37:00] > upstream connect error or disconnect/reset before headers. reset reason: overflow [02:37:16] persisting on refresh [02:38:21] now loading but slowly [02:39:31] phab is slow too [02:39:54] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [02:39:54] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [02:40:42] hello! looking [02:41:52] back to the upstream connect error for me [02:41:56] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [02:41:56] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [02:43:57] rzl: need any help? [02:44:03] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:44:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:46:08] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [02:47:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:48:24] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [02:49:41] Tamzin, AntiComposite: still looking but all the graphs look recovered from here, are you still seeing any trouble? [02:50:25] Page loads feel marginally slower than usual, but mostly fine [02:50:33] looks up from here, let me consult the hive mind [02:51:13] ACN and I are in (from a global perspective) the same area, if that matters [02:51:36] (greater eqiad) [02:51:52] The People's Republic thereof [02:52:12] haha, I grew up in the people's republic of greater eqiad :) say hi to Roscoe for me [02:53:23] and, looks like this one affected everywhere equally but it's still good data, thank you [02:53:38] I'm still seeing this issue [02:53:43] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 78 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:53:47] oops yeah, I just picked it up again too [02:53:49] still looking [02:55:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [02:56:10] can confirm phab is quite slow on eqiad [02:56:23] (Also are y'all also sorta-philly area? :O) [02:56:45] I'm experiencing slowness across all of Wikimedia [02:57:25] perryprog: Will DM. (Not about privacy, just because important things happening.) [02:57:43] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:57:55] SRE are working on this in another channel, in case the quiet in here was making anyone think we were ignoring it :) more updates in here soon [02:58:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:00:05] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:00:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:01:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:13:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:14:01] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 36.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:18:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 95.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:18:49] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:20:44] okay, hopefully for real this time :) anyone still having trouble? [03:21:37] Every refresh makes my computer catch even more on fire than usual No, all good here :) [03:23:22] well in my professional opinion, stop refreshing [03:23:36] thanks for the reports! much appreciated [03:27:47] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 23.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:28:41] ^ expected [03:29:03] gotta love the "everything is fine now" alarm [03:29:56] it's my favorite kind of alarm! [03:30:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:30:59] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 36.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:32:29] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 59 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:32:39] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:32:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 96.11 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:33:27] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:37:45] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10PrimeHunter) This is happening again to me and others at enwiki. Reported at [[ https://en.wikipedia.org/wiki/Wikipedia:Village_pump... [03:41:10] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10RLazarus) Thanks for letting us know! We did indeed have this issue again for a few minutes earlier (intermittently between 02:36 and... [03:49:43] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 79 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:11:27] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:12:51] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:03:09] (03PS1) 10KartikMistry: Update cxserver to 2022-02-24-035645-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/765361 (https://phabricator.wikimedia.org/T301443) [05:13:15] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 60 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:24:37] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 78 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:51:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:37] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:14:11] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:14:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:47] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:57] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 69 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:36:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:09] !log T301030#7734236 running UpdateWeightedTags.php on eswiki [06:36:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:19] T301030: Account creation: getting articles about Argentina, Chile, and Mexico into the Suggested Edits module - https://phabricator.wikimedia.org/T301030 [06:58:23] RECOVERY - puppet last run on thanos-be1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:40] (03CR) 10Santhosh: [C: 03+1] Update cxserver to 2022-02-24-035645-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/765361 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [07:07:44] (03PS1) 10Ayounsi: drmrs fix HE.net peers IP typos [homer/public] - 10https://gerrit.wikimedia.org/r/765364 [07:10:09] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:11:24] (03CR) 10Ayounsi: [C: 03+2] "Pushed manually and they're now established." [homer/public] - 10https://gerrit.wikimedia.org/r/765364 (owner: 10Ayounsi) [07:11:39] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 19, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:11:57] (03Merged) 10jenkins-bot: drmrs fix HE.net peers IP typos [homer/public] - 10https://gerrit.wikimedia.org/r/765364 (owner: 10Ayounsi) [07:17:17] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:18:24] jouncebot: refresh [07:18:25] I refreshed my knowledge about deployments. [07:18:32] jouncebot: next [07:18:32] In 0 hour(s) and 41 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0800) [07:21:01] (03CR) 10Ayounsi: [C: 04-1] "To get a +1: msw2-eqiad is the closest parent (and msw2 is child of msw1-eqiad)." [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [07:22:44] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Peter) 05Open→03Declined Chrome Lite pages will be removed with Chrome 100 https... [07:26:32] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Icinga: add drmrs routers mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/764725 (owner: 10Ayounsi) [07:45:56] (03CR) 10Ayounsi: [C: 03+2] drmrs: use BGP_aggregate_contributors for main prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/765205 (owner: 10Ayounsi) [07:46:30] (03Merged) 10jenkins-bot: drmrs: use BGP_aggregate_contributors for main prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/765205 (owner: 10Ayounsi) [07:51:38] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:53:17] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:58:45] * apergos hopes urbanecm is around [07:59:08] * urbanecm waves back to apergos [07:59:12] whew! [07:59:33] wanna join the google meet for the training? [07:59:51] apergos: sure but not sure where to find it [08:00:04] Amir1 and apergos: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0800) [08:00:04] RhinosF1 and tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:29] o/ [08:00:43] urbanecm: itt's on deployment cal [08:00:45] I'm here. we have two traininees for the window and [08:00:53] * apergos reloads the deployment calendar [08:00:56] 2 patches [08:01:24] urbanecm: i sent you it in PM [08:02:18] the first one is a ubn for fr, the second looks fine to me [08:03:07] * RhinosF1 is doing the testing for fr - andy needed sleep apergos [08:03:14] thanks [08:05:40] let me know when ready to test [08:05:52] hello! [08:06:04] sure -- it will be a bit slower because of the training, but we're starting :) [08:06:11] hello mfossati! [08:06:37] RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:40] (03PS1) 10RhinosF1: Revert "Show message fallback keys when using &uselang=qqx" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765369 (https://phabricator.wikimedia.org/T302469) [08:07:07] had master linked for some reason apergos, ^ is the backport [08:07:16] ok that's good, thanks [08:09:47] PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: proc-sys-fs-binfmt_misc.automount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:11] (03CR) 10Urbanecm: [C: 03+2] Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [08:13:16] (03CR) 10Urbanecm: [C: 03+2] Revert "Show message fallback keys when using &uselang=qqx" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765369 (https://phabricator.wikimedia.org/T302469) (owner: 10RhinosF1) [08:13:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10fgiunchedi) Thank you for your persistence on this @papaul, indeed the disk ordering issue is known :( we don't have a great story on how to re-init all disk... [08:13:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10akosiaris) [08:15:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10fgiunchedi) cc @MatthewVernon for visibility, as the hosts are otherwise good to go now [08:16:39] !log installing expat security updates [08:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:33] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:25:16] apergos: zuul is broken [08:25:18] (03CR) 10Hashar: [C: 03+1] ci: Qemu image and snapshot creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [08:25:26] WARN tar TAR_ENTRY_ERROR ENOSPC: no space left on device, open '/workspace/src/extensions/Wikibase/view/lib/wikibase-tainted-ref/node_modules/core-js/internals/engine-v8-version.js' [08:25:58] hashar: do you know how to unbreak jenkins [08:26:30] thanks for the heads up, and no I have no idea how to fix it [08:27:08] hi [08:27:16] what are the symptoms? [08:27:31] apergos: is it worth recheck when it fails or overriding the bot? [08:27:43] cause Zuul seems to work properly, there are jobs being triggered [08:27:45] hashar: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/137679/console [08:28:00] that's 100% a jenkins/zuul issue [08:28:01] ah out of disk that is annoying [08:28:20] that one ran on integration-agent-docker-1028 [08:29:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10LSobanski) a:05LSobanski→03RobH Fine by me. [08:30:15] RhinosF1: essentially `recheck` the change [08:30:35] hashar: in this case, re-+2 i guess [08:31:04] urbanecm: yep, can you when it fails? [08:31:12] Sure [08:33:28] (03CR) 10jerkins-bot: [V: 04-1] Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [08:35:07] tgr: your backport failed with an actual error [08:35:52] urbanecm: the build seems to have restarted [08:37:06] I guess we need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/764839 [08:37:35] :( [08:37:39] I wonder how the core patch worked [08:38:52] tgr: my patch failed before it started as one of the agent's was out of disk [08:39:19] (03PS1) 10Gergő Tisza: Disable broken test [extensions/Kartographer] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765370 (https://phabricator.wikimedia.org/T302360) [08:39:47] (03PS2) 10Urbanecm: Disable broken test [extensions/Kartographer] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765370 (https://phabricator.wikimedia.org/T302360) (owner: 10Gergő Tisza) [08:39:52] (03CR) 10Urbanecm: [C: 03+2] Disable broken test [extensions/Kartographer] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765370 (https://phabricator.wikimedia.org/T302360) (owner: 10Gergő Tisza) [08:40:03] tgr: heh, we were doing it at the same time :) [08:40:31] (03PS2) 10Urbanecm: Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [08:40:43] (03CR) 10Urbanecm: [C: 03+2] Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [08:43:02] hashar: will the train window that's right after the B&C be used? Due to the failing tests, we might need to overrun the window a little bit. [08:43:34] urbanecm: in short no, you can extend as needed [08:43:39] awesome! [08:43:40] the train is run primarily by Dan ;) [08:43:42] thanks! [08:43:47] sounds perfect :) [08:44:59] i hope we haven't scared the trainees with this fun [08:47:16] (03Merged) 10jenkins-bot: Revert "Show message fallback keys when using &uselang=qqx" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765369 (https://phabricator.wikimedia.org/T302469) (owner: 10RhinosF1) [08:47:27] finally! [08:47:32] \o/ [08:47:39] they're still here, I'm happy to say [08:48:15] i ready when you are with my test [08:50:24] RhinosF1: it's at mwdebug1001, can you test please? [08:51:07] doing [08:51:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:45] urbanecm: nope, I'm getting a new error [08:51:52] on https://en.wikipedia.org/wiki/Wikipedia?banner=B2122_0131_itIT_dsk_p2_sm_twin1&force=1&country=US [08:52:00] RhinosF1: are you saying it doesn't work and we should revert? [08:52:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:52:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:40] urbanecm: let me try the other banner first [08:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:45] sure [08:53:01] Uncaught RangeError: Incorrect locale information provided again [08:53:14] i know why [08:53:22] I don't :) [08:53:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:43] no [08:54:15] RhinosF1: can you clarify that please? [08:54:17] please let us know if this is a real error (and if you understand it) or [08:54:20] if we should revert. [08:54:26] +1 to that :) [08:54:52] urbanecm: i thought it might be because the link andy gave was for enwp but I can make the error on meta too [08:54:57] apergos: let's revert [08:55:00] ok! [08:55:02] okay, reverting [08:55:18] (03Merged) 10jenkins-bot: Disable broken test [extensions/Kartographer] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765370 (https://phabricator.wikimedia.org/T302360) (owner: 10Gergő Tisza) [08:57:48] (03PS1) 10Urbanecm: Revert "Revert "Show message fallback keys when using &uselang=qqx"" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765371 (https://phabricator.wikimedia.org/T302469) [08:58:11] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Revert "Show message fallback keys when using &uselang=qqx"" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765371 (https://phabricator.wikimedia.org/T302469) (owner: 10Urbanecm) [09:00:04] dduvall and hashar: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T0900). [09:00:15] ^ train will run tonight [09:00:49] ok, thanks hasha r [09:00:55] I'm waiting on CI for the GrowthExperiments patch again [09:01:23] !log Morning B&C window is overruning [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:34] urbanecm, apergos, hashar: thanks for the help! Tell the trainees too. [09:03:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:03:43] * RhinosF1 is off to have breakfast [09:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:47] :-) [09:04:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:09] (03Merged) 10jenkins-bot: Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765356 (https://phabricator.wikimedia.org/T302463) (owner: 10MewOphaswongse) [09:09:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:09:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:09:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [09:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:09:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:10:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:06] tgr: your patch is at mwdebug1001, can you test? [09:11:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:11:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [09:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300992)', diff saved to https://phabricator.wikimedia.org/P21421 and previous config saved to /var/cache/conftool/dbconfig/20220224-091132-ladsgroup.json [09:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:41] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:12:31] urbanecm: works, thanks! [09:13:42] syncing! [09:13:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300992)', diff saved to https://phabricator.wikimedia.org/P21422 and previous config saved to /var/cache/conftool/dbconfig/20220224-091350-ladsgroup.json [09:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:35] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/StructuredTaskArticleTarget.js: Backport: [[gerrit:765356|Structured task: Don't show dialog for confirming leaving suggestions mode upon rejection (T302463)]] (duration: 00m 50s) [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:40] T302463: Add an image [regression]: additional confirmation dialog shown when rejecting suggestion - https://phabricator.wikimedia.org/T302463 [09:14:42] tgr: it's live! [09:15:36] !log Morning B&C window is done [09:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:16:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:40] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [09:17:42] * urbanecm stashing at mwdebug1001 [09:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:27] * urbanecm done stashing [09:23:39] RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:03] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided) [09:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [09:24:12] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@17a70a0]: (no justification provided) (duration: 00m 08s) [09:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:51] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [09:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:18] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [09:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21423 and previous config saved to /var/cache/conftool/dbconfig/20220224-092855-ladsgroup.json [09:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:19] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [09:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [09:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [09:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:33] (03PS2) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) [09:40:12] (03PS1) 10David Caro: wmcs-cinder-backup-manager: end with error if any backup failed [puppet] - 10https://gerrit.wikimedia.org/r/765475 (https://phabricator.wikimedia.org/T302299) [09:42:47] (03PS1) 10Filippo Giunchedi: WIP add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 [09:44:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21424 and previous config saved to /var/cache/conftool/dbconfig/20220224-094400-ladsgroup.json [09:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:46] (03CR) 10jerkins-bot: [V: 04-1] WIP add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (owner: 10Filippo Giunchedi) [09:47:02] (03CR) 10Filippo Giunchedi: "@Cole I've been playing with this a little and I feel like I'm missing something obvious:" [puppet] - 10https://gerrit.wikimedia.org/r/765476 (owner: 10Filippo Giunchedi) [09:47:07] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 48.66 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:49:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:58:14] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@d28cd92]: Fix aqs/hourly in production by adding memory to driver [09:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:23] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@d28cd92]: Fix aqs/hourly in production by adding memory to driver (duration: 00m 09s) [09:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@d28cd92]: Fix aqs/hourly in production by adding memory to driver [09:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:52] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@d28cd92]: Fix aqs/hourly in production by adding memory to driver (duration: 00m 06s) [09:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300992)', diff saved to https://phabricator.wikimedia.org/P21425 and previous config saved to /var/cache/conftool/dbconfig/20220224-095904-ladsgroup.json [09:59:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:59:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:11] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:59:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21426 and previous config saved to /var/cache/conftool/dbconfig/20220224-095912-ladsgroup.json [09:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21427 and previous config saved to /var/cache/conftool/dbconfig/20220224-100147-ladsgroup.json [10:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:27] (03PS1) 10Muehlenhoff: Prometheus/main: Add apache2 to profile::lvs::realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/765479 [10:02:57] !log restarting apache on edge prometheus nodes to pickup expat update [10:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:03] (03CR) 10Jbond: ci: Qemu image and snapshot creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [10:07:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765479 (owner: 10Muehlenhoff) [10:13:30] !log depool cp4028.ulsfo.wmnet - T302301 [10:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [10:14:07] (03PS2) 10Ayounsi: Export drmrs mgmt and private prefixes over BGP [homer/public] - 10https://gerrit.wikimedia.org/r/765240 [10:15:01] !log deploying schema change to s1 T300774 [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:07] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:15:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:15:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:15:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21428 and previous config saved to /var/cache/conftool/dbconfig/20220224-101559-kormat.json [10:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:14] (03CR) 10Ayounsi: Export drmrs mgmt and private prefixes over BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/765240 (owner: 10Ayounsi) [10:16:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21429 and previous config saved to /var/cache/conftool/dbconfig/20220224-101651-ladsgroup.json [10:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040) (owner: 10Hashar) [10:18:42] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21430 and previous config saved to /var/cache/conftool/dbconfig/20220224-101841-kormat.json [10:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM but the PCC matcher says no (?)" [puppet] - 10https://gerrit.wikimedia.org/r/765479 (owner: 10Muehlenhoff) [10:21:36] (03CR) 10MMandere: [V: 03+1 C: 03+2] varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [10:21:52] (03PS3) 10MMandere: varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) [10:22:41] (03CR) 10MMandere: [V: 03+2 C: 03+2] varnish: change the default archive component for varnish [puppet] - 10https://gerrit.wikimedia.org/r/765200 (https://phabricator.wikimedia.org/T302301) (owner: 10MMandere) [10:28:22] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) [10:28:25] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [10:28:30] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) p:05Triage→03Low [10:29:00] (03CR) 10Ayounsi: [C: 03+1] "I don't know the full logic of this puppet code, but I checked the interfaces/IP/vlan and they lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [10:29:04] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) As all the packages we need have already been packaged by debian, my view is we just go with the debian packages and close this ticket down. [10:30:21] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) >>! In T302423#7733445, @jbond wrote: >> in comparison to say the cron module which is still shipped by Puppet as part of their agent package. How the cron module should be packaged in... [10:30:46] (03PS1) 10Filippo Giunchedi: WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [10:31:08] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [10:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21431 and previous config saved to /var/cache/conftool/dbconfig/20220224-103156-ladsgroup.json [10:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21432 and previous config saved to /var/cache/conftool/dbconfig/20220224-103346-kormat.json [10:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:21] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3): Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#7698301, @Volans wrote: > Today @jbond and I joined the office hours of #sre_observabi... [10:36:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [10:39:20] (03PS2) 10Filippo Giunchedi: WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [10:39:27] (03PS3) 10Cathal Mooney: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) [10:40:14] (03CR) 10Muehlenhoff: Prometheus/main: Add apache2 to profile::lvs::realserver::pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765479 (owner: 10Muehlenhoff) [10:41:27] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [10:41:47] (03CR) 10Muehlenhoff: [C: 03+2] Prometheus/main: Add apache2 to profile::lvs::realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/765479 (owner: 10Muehlenhoff) [10:47:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21433 and previous config saved to /var/cache/conftool/dbconfig/20220224-104700-ladsgroup.json [10:47:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:47:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:07] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:47:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300992)', diff saved to https://phabricator.wikimedia.org/P21434 and previous config saved to /var/cache/conftool/dbconfig/20220224-104708-ladsgroup.json [10:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:51] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21435 and previous config saved to /var/cache/conftool/dbconfig/20220224-104851-kormat.json [10:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:10] (03PS1) 10Majavah: P:wmcs::prometheus: drop tools-redis jobs from cloudmetrics* [puppet] - 10https://gerrit.wikimedia.org/r/765483 [10:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300992)', diff saved to https://phabricator.wikimedia.org/P21436 and previous config saved to /var/cache/conftool/dbconfig/20220224-104925-ladsgroup.json [10:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:32] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 40 probes of 750 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:49:57] !log enable-puppet on cp instances after finishing successfully testing varnish package component change - T302301 [10:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:03] T302301: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 [10:50:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33969/console" [puppet] - 10https://gerrit.wikimedia.org/r/765483 (owner: 10Majavah) [10:52:57] !log restarting apache on main prometheus nodes to pickup expat update [10:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] (03PS1) 10Majavah: P:wmcs::prometheus: update rabbitmq metrics port [puppet] - 10https://gerrit.wikimedia.org/r/765484 (https://phabricator.wikimedia.org/T300308) [10:54:23] (03PS1) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [10:54:25] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@97759bf]: Set aqs/hourly start date [10:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:32] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@97759bf]: Set aqs/hourly start date (duration: 00m 06s) [10:54:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33971/console" [puppet] - 10https://gerrit.wikimedia.org/r/765484 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [10:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:16] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to Put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10Aklapper) [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T1100). [11:00:40] (03PS2) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [11:01:13] (03CR) 10Hnowlan: [C: 03+2] "This is awesome, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/765317 (owner: 10Majavah) [11:03:34] !log restarting apache/carbon-cache on graphite nodes to pickup expat update [11:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21437 and previous config saved to /var/cache/conftool/dbconfig/20220224-110355-kormat.json [11:03:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:03:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:01] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:04:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21438 and previous config saved to /var/cache/conftool/dbconfig/20220224-110403-kormat.json [11:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21439 and previous config saved to /var/cache/conftool/dbconfig/20220224-110430-ladsgroup.json [11:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21440 and previous config saved to /var/cache/conftool/dbconfig/20220224-110645-kormat.json [11:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:10] !log Updated Jenkins job operations-puppet-tests-buster-docker https://gerrit.wikimedia.org/r/c/integration/config/+/765487 [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:48] (03CR) 10Ayounsi: [C: 03+1] Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:10:36] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10RhinosF1) [11:11:50] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:21] (03CR) 10Ayounsi: Bird: disable multihop when peer is the default route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [11:14:01] (03PS1) 10Ladsgroup: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765489 (https://phabricator.wikimedia.org/T302363) [11:14:48] (03PS2) 10Ladsgroup: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765489 (https://phabricator.wikimedia.org/T302185) [11:15:22] Planning to deploy cxserver. @mvolz are you deploying (Citoid) right now? [11:16:15] (03PS3) 10Ladsgroup: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765489 (https://phabricator.wikimedia.org/T302185) [11:17:11] OK. I will go ahead. [11:17:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765489 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [11:17:39] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-02-24-035645-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/765361 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [11:19:10] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21441 and previous config saved to /var/cache/conftool/dbconfig/20220224-111935-ladsgroup.json [11:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:22] (03Merged) 10jenkins-bot: Update cxserver to 2022-02-24-035645-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/765361 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [11:21:46] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 750 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:21:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21442 and previous config saved to /var/cache/conftool/dbconfig/20220224-112149-kormat.json [11:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:35] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:40] !log rolling restart of thanos frontend swift-proxy/apache to pick up expat security updates [11:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:08] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:56] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:34] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Ammarpad) Hi @Dzahn, I have sent. [11:28:23] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: introduce component bullseye-wikimedia/thirdparty/openstack-db [puppet] - 10https://gerrit.wikimedia.org/r/765493 (https://phabricator.wikimedia.org/T302482) [11:28:30] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:48] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:26] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: introduce component bullseye-wikimedia/thirdparty/openstack-db [puppet] - 10https://gerrit.wikimedia.org/r/765493 (https://phabricator.wikimedia.org/T302482) [11:32:55] (03CR) 10Volans: "Thanks for approaching this! The general structure looks good, I've left a bunch of comments inline, feel free to ping me on IRC if you ha" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [11:34:04] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300992)', diff saved to https://phabricator.wikimedia.org/P21443 and previous config saved to /var/cache/conftool/dbconfig/20220224-113439-ladsgroup.json [11:34:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:34:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:34:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:46] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:34:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300992)', diff saved to https://phabricator.wikimedia.org/P21444 and previous config saved to /var/cache/conftool/dbconfig/20220224-113453-ladsgroup.json [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:28] !log Updated cxserver to 2022-02-24-035645-production (T301443, T301952) [11:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:35] T301952: For references with multiple templates, the calculation of overall adaptation status is wrong - https://phabricator.wikimedia.org/T301952 [11:35:36] T301443: Enable Flores for Occitan and Luganda - https://phabricator.wikimedia.org/T301443 [11:36:55] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21445 and previous config saved to /var/cache/conftool/dbconfig/20220224-113654-kormat.json [11:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300992)', diff saved to https://phabricator.wikimedia.org/P21446 and previous config saved to /var/cache/conftool/dbconfig/20220224-113710-ladsgroup.json [11:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/765493 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [11:38:59] (03PS1) 10Giuseppe Lavagetto: deployment_server: re-add mwbuilder to the docker group [puppet] - 10https://gerrit.wikimedia.org/r/765494 (https://phabricator.wikimedia.org/T297673) [11:39:01] (03PS1) 10Giuseppe Lavagetto: parsoid: apache2 is in the call stack for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/765495 [11:39:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: introduce component bullseye-wikimedia/thirdparty/openstack-db [puppet] - 10https://gerrit.wikimedia.org/r/765493 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [11:41:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33972/console" [puppet] - 10https://gerrit.wikimedia.org/r/765495 (owner: 10Giuseppe Lavagetto) [11:41:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: re-add mwbuilder to the docker group [puppet] - 10https://gerrit.wikimedia.org/r/765494 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [11:41:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] parsoid: apache2 is in the call stack for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/765495 (owner: 10Giuseppe Lavagetto) [11:43:42] (03CR) 10David Caro: [C: 03+1] "Rabbit is listening already there yep 😊" [puppet] - 10https://gerrit.wikimedia.org/r/765484 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [11:49:44] (03PS2) 10Hnowlan: restbase: disable redundant jmx config [puppet] - 10https://gerrit.wikimedia.org/r/765313 (https://phabricator.wikimedia.org/T295375) [11:51:32] (03CR) 10Hnowlan: [C: 03+2] restbase: disable redundant jmx config [puppet] - 10https://gerrit.wikimedia.org/r/765313 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [11:52:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300774)', diff saved to https://phabricator.wikimedia.org/P21447 and previous config saved to /var/cache/conftool/dbconfig/20220224-115159-kormat.json [11:52:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [11:52:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21448 and previous config saved to /var/cache/conftool/dbconfig/20220224-115215-ladsgroup.json [11:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:52:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300774)', diff saved to https://phabricator.wikimedia.org/P21449 and previous config saved to /var/cache/conftool/dbconfig/20220224-115246-kormat.json [11:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:28] (03PS1) 10Muehlenhoff: thanos-fe: Add apache2 to profile::lvs::realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/765499 [11:54:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765499 (owner: 10Muehlenhoff) [11:55:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300774)', diff saved to https://phabricator.wikimedia.org/P21450 and previous config saved to /var/cache/conftool/dbconfig/20220224-115522-kormat.json [11:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] (03PS1) 10Kevin Bazira: ml-services: add itwiki, jawiki & kowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) [12:03:54] (03PS1) 10JMeybohm: Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) [12:04:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2079.codfw.wmnet with reason: Maintenance [12:04:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2079.codfw.wmnet with reason: Maintenance [12:04:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [12:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [12:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:41] !log aborrero@apt1001:~$ sudo -i reprepro --component thirdparty/openstack-db update bullseye-wikimedia (T302482) [12:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:46] T302482: openstack db: figure out new versions for galera & mariadb - https://phabricator.wikimedia.org/T302482 [12:05:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::prometheus: drop tools-redis jobs from cloudmetrics* [puppet] - 10https://gerrit.wikimedia.org/r/765483 (owner: 10Majavah) [12:05:33] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 61 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:05:44] I'm upgrading s8 codfw primary to bullseye, there will be a massive lag to codfw s8 altogether, let me know if it causes alerts, etc. [12:06:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2079.codfw.wmnet with OS bullseye [12:06:06] (03CR) 10jerkins-bot: [V: 04-1] Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:44] (03PS1) 10Giuseppe Lavagetto: Rakefile: fix loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/765504 [12:07:00] (03PS2) 10Arturo Borrero Gonzalez: P:wmcs::prometheus: update rabbitmq metrics port [puppet] - 10https://gerrit.wikimedia.org/r/765484 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [12:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21451 and previous config saved to /var/cache/conftool/dbconfig/20220224-120720-ladsgroup.json [12:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::prometheus: update rabbitmq metrics port [puppet] - 10https://gerrit.wikimedia.org/r/765484 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [12:10:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21452 and previous config saved to /var/cache/conftool/dbconfig/20220224-121027-kormat.json [12:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] !log dbmaint on s8@codfw (T302185) [12:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:40] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [12:17:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [12:21:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2079.codfw.wmnet with reason: host reimage [12:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300992)', diff saved to https://phabricator.wikimedia.org/P21453 and previous config saved to /var/cache/conftool/dbconfig/20220224-122224-ladsgroup.json [12:22:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:22:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:31] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:22:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21454 and previous config saved to /var/cache/conftool/dbconfig/20220224-122232-ladsgroup.json [12:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2079.codfw.wmnet with reason: host reimage [12:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21455 and previous config saved to /var/cache/conftool/dbconfig/20220224-122519-ladsgroup.json [12:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21456 and previous config saved to /var/cache/conftool/dbconfig/20220224-122532-kormat.json [12:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:46] kart_: was afk, yes it was fine :) [12:26:57] (03PS2) 10Giuseppe Lavagetto: Rakefile: fix bad asset handling for helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/765504 [12:26:59] (03PS2) 10Giuseppe Lavagetto: Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:29:02] (03PS1) 10Hnowlan: restbase: change endpoint for deployment-prep to new host [puppet] - 10https://gerrit.wikimedia.org/r/765532 (https://phabricator.wikimedia.org/T295375) [12:29:04] (03CR) 10jerkins-bot: [V: 04-1] Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:30:59] (03PS2) 10Majavah: remove clush modules, profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/762829 (https://phabricator.wikimedia.org/T298191) [12:31:25] 10SRE, 10Traffic: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10MMandere) 05Open→03Resolved We now have all Varnish6 and its dependencies moved to the main component for the buster-wikimedia distribution. In the process, all Varnish5 packages for buster-wikimedia distr... [12:35:30] (03PS1) 10Jbond: (WIP) wmflib: new firmware fact [puppet] - 10https://gerrit.wikimedia.org/r/765534 [12:36:15] (03CR) 10jerkins-bot: [V: 04-1] (WIP) wmflib: new firmware fact [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [12:39:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2079.codfw.wmnet with OS bullseye [12:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] remove clush modules, profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/762829 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [12:40:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P21457 and previous config saved to /var/cache/conftool/dbconfig/20220224-124024-ladsgroup.json [12:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:37] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300774)', diff saved to https://phabricator.wikimedia.org/P21458 and previous config saved to /var/cache/conftool/dbconfig/20220224-124036-kormat.json [12:40:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:40:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:43] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:41:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300774)', diff saved to https://phabricator.wikimedia.org/P21459 and previous config saved to /var/cache/conftool/dbconfig/20220224-124122-kormat.json [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:01] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300774)', diff saved to https://phabricator.wikimedia.org/P21460 and previous config saved to /var/cache/conftool/dbconfig/20220224-124401-kormat.json [12:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:12] mvolz: :) [12:51:34] (03PS1) 10Arturo Borrero Gonzalez: galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) [12:52:07] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:52:24] (03CR) 10jerkins-bot: [V: 04-1] galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [12:54:41] (03CR) 10Vivian Rook: [C: 03+1] wmcs-cinder-backup-manager: end with error if any backup failed [puppet] - 10https://gerrit.wikimedia.org/r/765475 (https://phabricator.wikimedia.org/T302299) (owner: 10David Caro) [12:54:46] (03CR) 10Ayounsi: [C: 03+2] Bird: disable multihop when peer is the default route [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [12:54:49] (03PS2) 10Arturo Borrero Gonzalez: galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) [12:55:17] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P21461 and previous config saved to /var/cache/conftool/dbconfig/20220224-125528-ladsgroup.json [12:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:33] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-fe: Add apache2 to profile::lvs::realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/765499 (owner: 10Muehlenhoff) [12:58:30] (03CR) 10Muehlenhoff: "I think it's better if you use priority=>1002 here: It seems like you need the specific version of Mariadb from the component and Mariadb " [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [12:58:41] (03PS1) 10Ayounsi: Revert "Bird: disable multihop when peer is the default route" [puppet] - 10https://gerrit.wikimedia.org/r/765379 [12:59:06] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21462 and previous config saved to /var/cache/conftool/dbconfig/20220224-125905-kormat.json [12:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:23] (03PS3) 10Arturo Borrero Gonzalez: galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) [12:59:57] PROBLEM - puppet last run on thanos-be1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:00:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "Bird: disable multihop when peer is the default route" [puppet] - 10https://gerrit.wikimedia.org/r/765379 (owner: 10Ayounsi) [13:00:37] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:48] (03PS2) 10Ayounsi: Revert "Bird: disable multihop when peer is the default route" [puppet] - 10https://gerrit.wikimedia.org/r/765379 [13:02:01] XioNoX: I have a fix, no need to revert [13:02:20] taavi: thanks! I'll hold on the revert [13:02:38] !log restarting apache/uwsgi-puppetboard on puppetboard* to pick up expat security updates [13:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] (03PS1) 10Majavah: P:bird::anycast: use a new variable instead of overwriting $multihop [puppet] - 10https://gerrit.wikimedia.org/r/765538 [13:04:36] (03PS1) 10Ladsgroup: Revert "db2079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765380 [13:04:43] XioNoX: https://gerrit.wikimedia.org/r/c/operations/puppet/+/765379/ [13:04:50] looking [13:04:54] (03PS2) 10Ladsgroup: Revert "db2079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765380 [13:04:55] wrong link [13:05:00] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765380 (owner: 10Ladsgroup) [13:05:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/765538/ [13:05:37] (03PS4) 10Arturo Borrero Gonzalez: galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) [13:05:50] taavi: thanks! [13:05:59] (03CR) 10Ayounsi: [C: 03+1] P:bird::anycast: use a new variable instead of overwriting $multihop [puppet] - 10https://gerrit.wikimedia.org/r/765538 (owner: 10Majavah) [13:06:24] (03CR) 10jerkins-bot: [V: 04-1] galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [13:06:34] (03CR) 10Ayounsi: [C: 03+2] P:bird::anycast: use a new variable instead of overwriting $multihop [puppet] - 10https://gerrit.wikimedia.org/r/765538 (owner: 10Majavah) [13:09:57] (03Abandoned) 10Ayounsi: Revert "Bird: disable multihop when peer is the default route" [puppet] - 10https://gerrit.wikimedia.org/r/765379 (owner: 10Ayounsi) [13:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21463 and previous config saved to /var/cache/conftool/dbconfig/20220224-131033-ladsgroup.json [13:10:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:10:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:40] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300992)', diff saved to https://phabricator.wikimedia.org/P21464 and previous config saved to /var/cache/conftool/dbconfig/20220224-131041-ladsgroup.json [13:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] (03PS1) 10Ladsgroup: db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765539 (https://phabricator.wikimedia.org/T302363) [13:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/765539 (https://phabricator.wikimedia.org/T302363) (owner: 10Ladsgroup) [13:12:21] (03CR) 10Ayounsi: [C: 03+2] Export drmrs mgmt and private prefixes over BGP [homer/public] - 10https://gerrit.wikimedia.org/r/765240 (owner: 10Ayounsi) [13:12:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300992)', diff saved to https://phabricator.wikimedia.org/P21465 and previous config saved to /var/cache/conftool/dbconfig/20220224-131257-ladsgroup.json [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:15] (03Merged) 10jenkins-bot: Export drmrs mgmt and private prefixes over BGP [homer/public] - 10https://gerrit.wikimedia.org/r/765240 (owner: 10Ayounsi) [13:14:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21466 and previous config saved to /var/cache/conftool/dbconfig/20220224-131410-kormat.json [13:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:45] (03PS5) 10Arturo Borrero Gonzalez: galera: install packages from our custom component [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) [13:23:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] galera: install packages from our custom component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765536 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [13:23:57] !log dbmaint on s7@codfw (T302363) [13:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:04] T302363: Upgrade s7 to bullseye - https://phabricator.wikimedia.org/T302363 [13:24:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:24:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:24:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [13:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [13:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2121.codfw.wmnet with OS bullseye [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21467 and previous config saved to /var/cache/conftool/dbconfig/20220224-132802-ladsgroup.json [13:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:14] (03PS1) 10Arturo Borrero Gonzalez: galera: fix typo in priority [puppet] - 10https://gerrit.wikimedia.org/r/765540 (https://phabricator.wikimedia.org/T302482) [13:29:15] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300774)', diff saved to https://phabricator.wikimedia.org/P21468 and previous config saved to /var/cache/conftool/dbconfig/20220224-132915-kormat.json [13:29:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:29:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:22] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:29:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300774)', diff saved to https://phabricator.wikimedia.org/P21469 and previous config saved to /var/cache/conftool/dbconfig/20220224-132923-kormat.json [13:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300774)', diff saved to https://phabricator.wikimedia.org/P21470 and previous config saved to /var/cache/conftool/dbconfig/20220224-133202-kormat.json [13:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:13] (03CR) 10Muehlenhoff: [C: 03+2] thanos-fe: Add apache2 to profile::lvs::realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/765499 (owner: 10Muehlenhoff) [13:36:27] (03CR) 10Klausman: [C: 03+1] ml-services: add itwiki, jawiki & kowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:37:00] (03PS2) 10Arturo Borrero Gonzalez: galera: type fixes [puppet] - 10https://gerrit.wikimedia.org/r/765540 (https://phabricator.wikimedia.org/T302482) [13:38:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] galera: type fixes [puppet] - 10https://gerrit.wikimedia.org/r/765540 (https://phabricator.wikimedia.org/T302482) (owner: 10Arturo Borrero Gonzalez) [13:39:05] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10CDanis) Have you also considered [[ https://www.atlassian.com/git/tutorials/git-subtree | git subtree ]] instead of git submodules? [13:42:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2121.codfw.wmnet with reason: host reimage [13:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21471 and previous config saved to /var/cache/conftool/dbconfig/20220224-134307-ladsgroup.json [13:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2121.codfw.wmnet with reason: host reimage [13:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] RECOVERY - BFD status on asw1-b12-drmrs.wikimedia.org is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:47:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21472 and previous config saved to /var/cache/conftool/dbconfig/20220224-134707-kormat.json [13:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:47] (03PS2) 10Vgutierrez: Drop globalsign from the CAA records [dns] - 10https://gerrit.wikimedia.org/r/673997 (https://phabricator.wikimedia.org/T266503) [13:52:22] (03CR) 10Vgutierrez: [C: 03+2] Drop globalsign from the CAA records [dns] - 10https://gerrit.wikimedia.org/r/673997 (https://phabricator.wikimedia.org/T266503) (owner: 10Vgutierrez) [13:55:10] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:55:22] (03CR) 10David Caro: [C: 03+2] wmcs-cinder-backup-manager: end with error if any backup failed [puppet] - 10https://gerrit.wikimedia.org/r/765475 (https://phabricator.wikimedia.org/T302299) (owner: 10David Caro) [13:58:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300992)', diff saved to https://phabricator.wikimedia.org/P21473 and previous config saved to /var/cache/conftool/dbconfig/20220224-135811-ladsgroup.json [13:58:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:58:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21474 and previous config saved to /var/cache/conftool/dbconfig/20220224-135819-ladsgroup.json [13:58:19] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:53] (03CR) 10Elukey: "Kevin I found a possible issue with the itwiki damaging model version, not sure if I am missing something or if the version is wrong, lemm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:59:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21475 and previous config saved to /var/cache/conftool/dbconfig/20220224-135955-ladsgroup.json [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:14] indeed, nothing to do [14:00:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2121.codfw.wmnet with OS bullseye [14:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21476 and previous config saved to /var/cache/conftool/dbconfig/20220224-140212-kormat.json [14:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:39] (03CR) 10Ayounsi: [C: 03+2] Prepend AS to anycast prefixes learned on the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [14:08:12] (03Merged) 10jenkins-bot: Prepend AS to anycast prefixes learned on the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/765268 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [14:13:18] (03PS1) 10Ayounsi: as-path-expand: formating fix [homer/public] - 10https://gerrit.wikimedia.org/r/765543 [14:15:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21477 and previous config saved to /var/cache/conftool/dbconfig/20220224-141501-ladsgroup.json [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] (03PS3) 10JMeybohm: Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) [14:16:37] (03CR) 10Ayounsi: [C: 03+2] as-path-expand: formating fix [homer/public] - 10https://gerrit.wikimedia.org/r/765543 (owner: 10Ayounsi) [14:16:55] (03PS1) 10Ssingh: bird: set interface for link-local BGP IPv6 sessions [puppet] - 10https://gerrit.wikimedia.org/r/765544 [14:17:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300774)', diff saved to https://phabricator.wikimedia.org/P21478 and previous config saved to /var/cache/conftool/dbconfig/20220224-141717-kormat.json [14:17:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:17:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:23] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:17:25] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300774)', diff saved to https://phabricator.wikimedia.org/P21479 and previous config saved to /var/cache/conftool/dbconfig/20220224-141724-kormat.json [14:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:47] (03Merged) 10jenkins-bot: as-path-expand: formating fix [homer/public] - 10https://gerrit.wikimedia.org/r/765543 (owner: 10Ayounsi) [14:18:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33977/console" [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [14:19:49] !log Prepend AS to anycast prefixes learned on the core routers - T302315 [14:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:55] T302315: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 [14:19:56] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Jdforrester-WMF) [14:20:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300774)', diff saved to https://phabricator.wikimedia.org/P21480 and previous config saved to /var/cache/conftool/dbconfig/20220224-142004-kormat.json [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:16] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Jdforrester-WMF) [14:22:36] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [14:24:10] (03PS1) 10Ladsgroup: Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765381 [14:24:27] (03PS2) 10Ladsgroup: Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765381 [14:24:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/765381 (owner: 10Ladsgroup) [14:28:29] (03CR) 10Ssingh: [V: 03+1 C: 04-1] "Marking as -1 as this is not complete yet; working on the fix. NOOP on the doh1002 and dns1002 is good and expected." [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [14:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21481 and previous config saved to /var/cache/conftool/dbconfig/20220224-143005-ladsgroup.json [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:26] (03PS3) 10Filippo Giunchedi: WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [14:30:28] (03CR) 10Filippo Giunchedi: WIP: new module alertmanager (0313 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [14:35:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21482 and previous config saved to /var/cache/conftool/dbconfig/20220224-143509-kormat.json [14:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300992)', diff saved to https://phabricator.wikimedia.org/P21483 and previous config saved to /var/cache/conftool/dbconfig/20220224-144511-ladsgroup.json [14:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:19] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [14:49:14] (03PS1) 10Kormat: Drop support for stretch. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/765547 [14:49:51] (03PS2) 10Filippo Giunchedi: WIP add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 [14:49:53] (03PS1) 10Filippo Giunchedi: prometheus: ditch automatic icmp probes for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) [14:50:16] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21484 and previous config saved to /var/cache/conftool/dbconfig/20220224-145015-kormat.json [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] (03PS2) 10Ssingh: bird: set interface for link-local BGP IPv6 sessions [puppet] - 10https://gerrit.wikimedia.org/r/765544 [14:51:21] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33978/console" [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [14:51:56] (03PS1) 10Ayounsi: Update local_anycast to reflect the anycast prepending [homer/public] - 10https://gerrit.wikimedia.org/r/765549 (https://phabricator.wikimedia.org/T302315) [14:52:10] (03CR) 10jerkins-bot: [V: 04-1] WIP add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (owner: 10Filippo Giunchedi) [14:52:40] (03CR) 10Eevans: [C: 03+1] restbase: change endpoint for deployment-prep to new host [puppet] - 10https://gerrit.wikimedia.org/r/765532 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [14:52:53] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:01] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/765549 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [14:53:21] (03CR) 10Ayounsi: [C: 03+2] Update local_anycast to reflect the anycast prepending [homer/public] - 10https://gerrit.wikimedia.org/r/765549 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [14:53:53] (03Merged) 10jenkins-bot: Update local_anycast to reflect the anycast prepending [homer/public] - 10https://gerrit.wikimedia.org/r/765549 (https://phabricator.wikimedia.org/T302315) (owner: 10Ayounsi) [14:54:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33979/console" [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:56:54] (03CR) 10Ssingh: [V: 03+1] "PCC looks OK! NOOP on doh1002 and dns1002, and change on doh6001:" [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [14:57:55] (03CR) 10Herron: [C: 03+1] "STGM" [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:58:01] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Peter) 05Declined→03Open Argh I was confused. Chrome Lite (=you turn it on in Ch... [15:03:05] (03CR) 10Kormat: [C: 03+2] Drop support for stretch. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/765547 (owner: 10Kormat) [15:04:32] (03Merged) 10jenkins-bot: Drop support for stretch. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/765547 (owner: 10Kormat) [15:05:20] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300774)', diff saved to https://phabricator.wikimedia.org/P21486 and previous config saved to /var/cache/conftool/dbconfig/20220224-150520-kormat.json [15:05:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:05:23] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:27] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:05:28] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300774)', diff saved to https://phabricator.wikimedia.org/P21487 and previous config saved to /var/cache/conftool/dbconfig/20220224-150527-kormat.json [15:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Thank you @cmooney and @BBlack for the explanations and for digging int... [15:09:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster [15:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [15:10:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300774)', diff saved to https://phabricator.wikimedia.org/P21488 and previous config saved to /var/cache/conftool/dbconfig/20220224-151007-kormat.json [15:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:02] (03CR) 10Elukey: [C: 03+1] Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:12:22] (03CR) 10Hashar: [C: 03+1] ci: Qemu image and snapshot creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [15:12:23] 10SRE, 10ops-eqiad, 10DC-Ops: cloudvirt1017.mgmt/SSH - https://phabricator.wikimedia.org/T302016 (10Cmjohnson) 05Open→03Resolved it was the cable, resolving [15:12:36] (03PS19) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [15:16:48] (03PS2) 10Jbond: wmflib: new firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765534 [15:22:06] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Cmjohnson) Disk ordered through Dell You have successfully submitted request SR1085442886. [15:22:53] hashar: i can merge that CR now if you want [15:23:37] jbond: if that looks good to you sure! [15:23:53] I havent tested that last iteration but it should be fine [15:24:02] (03CR) 10Jbond: [C: 03+2] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [15:24:04] (03CR) 10BBlack: [C: 03+1] "LGTM too, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:24:28] hashar: merged [15:24:50] :)) [15:25:03] Will retest it later today or tomorrow [15:25:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21489 and previous config saved to /var/cache/conftool/dbconfig/20220224-152512-kormat.json [15:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:27] One sure thing, thank you to have caught the sha sum was never checked! [15:31:53] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10jhathaway) >>! In T302481#7734582, @jbond wrote: > As all the packages we need have already been packaged by debian, my view is we just go with the debian packages and close this... [15:34:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:34:49] 10Puppet, 10SRE, 10Infrastructure-Foundations: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10jbond) 05Open→03Resolved ack ill resolve this in that case, the task is still around if anyone wants to object they can re-open [15:35:00] (03CR) 10JMeybohm: [C: 03+2] Rakefile: fix bad asset handling for helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/765504 (owner: 10Giuseppe Lavagetto) [15:35:04] (03CR) 10JMeybohm: [C: 03+2] Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:36:27] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1032.eqiad.wmnet with OS buster [15:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [15:36:37] (03Abandoned) 10Muehlenhoff: Prometheus: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/763201 (owner: 10Muehlenhoff) [15:37:12] (03PS3) 10Jbond: wmflib: new firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765534 [15:37:22] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:03] (03CR) 10Volans: "I skipped the tests, some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [15:38:39] (03Merged) 10jenkins-bot: Rakefile: fix bad asset handling for helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/765504 (owner: 10Giuseppe Lavagetto) [15:38:41] (03Merged) 10jenkins-bot: Add tlsExtraSANs config to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/765502 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:39:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21490 and previous config saved to /var/cache/conftool/dbconfig/20220224-154016-kormat.json [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:09] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:46] !log restarting apache on people.w.o, planet.w.o, releases* to pick up expat update [15:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:20] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:24] (03CR) 10Ayounsi: [C: 03+1] "Change LGTM, and PCC makes it safer to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [15:47:47] !log restarting apache on otrs1001/ticket.wikimedia.org [15:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:19] (03PS1) 10Majavah: Hook up cloudmetrics prometheus to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) [15:52:09] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:19] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:50] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33980/console" [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [15:54:48] (03PS1) 10Muehlenhoff: Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 [15:54:51] 10SRE, 10vm-requests: New VMs for ML staging cluster in eqiad - https://phabricator.wikimedia.org/T302503 (10klausman) [15:54:56] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:05] (03PS2) 10Majavah: Hook up cloudmetrics prometheus to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) [15:55:11] 10SRE, 10vm-requests: New VMs for ML staging cluster in codfw - https://phabricator.wikimedia.org/T302503 (10klausman) [15:55:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300774)', diff saved to https://phabricator.wikimedia.org/P21491 and previous config saved to /var/cache/conftool/dbconfig/20220224-155521-kormat.json [15:55:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:55:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:27] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:56:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:56:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [15:56:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [15:57:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33981/console" [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [15:57:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300774)', diff saved to https://phabricator.wikimedia.org/P21492 and previous config saved to /var/cache/conftool/dbconfig/20220224-155708-kormat.json [15:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300774)', diff saved to https://phabricator.wikimedia.org/P21493 and previous config saved to /var/cache/conftool/dbconfig/20220224-155944-kormat.json [15:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:54] (03CR) 10David Caro: [C: 03+1] "Just one nit, LGTM, though Filippo might want to review before merge." [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:00:07] (03PS4) 10Jbond: wmflib: new firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765534 [16:00:15] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [16:01:13] (03PS5) 10Jbond: wmflib: new firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765534 [16:04:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline for external_labels but other than that looks good" [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:08:12] 10SRE, 10vm-requests: New control plane VMs for ML staging cluster in codfw - https://phabricator.wikimedia.org/T302504 (10klausman) [16:09:55] (03CR) 10Cwhite: WIP add blackbox-exporter filter config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765476 (owner: 10Filippo Giunchedi) [16:10:48] (03PS3) 10Majavah: Hook up cloudmetrics prometheus to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) [16:11:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [16:11:48] (03CR) 10Majavah: Hook up cloudmetrics prometheus to alertmanager (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:12:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33982/console" [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:13:52] (03PS1) 10JMeybohm: Add static-bugzilla.wikimedia.org gatewayHost to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/765564 (https://phabricator.wikimedia.org/T290966) [16:14:09] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:28] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:49] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21494 and previous config saved to /var/cache/conftool/dbconfig/20220224-161449-kormat.json [16:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:21] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:37] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:19:14] (03CR) 10JMeybohm: [C: 03+2] Add static-bugzilla.wikimedia.org gatewayHost to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/765564 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:19:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2079.codfw.wmnet with OS bullseye [16:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2079.codfw.wmnet with OS bullseye [16:20:34] (03CR) 10Filippo Giunchedi: WIP add blackbox-exporter filter config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765476 (owner: 10Filippo Giunchedi) [16:20:56] (03CR) 10Volans: "Last minute question inline" [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [16:21:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:56] (03Merged) 10jenkins-bot: Add static-bugzilla.wikimedia.org gatewayHost to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/765564 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:23:16] (03CR) 10David Caro: [C: 03+2] Hook up cloudmetrics prometheus to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/765561 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:24:01] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:28] (03PS1) 10Giuseppe Lavagetto: deployment_server::kubernetes: add docker firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/765565 [16:24:31] (03PS4) 10Hnowlan: maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 (owner: 10MSantos) [16:26:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:34] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Cmjohnson) [16:27:11] (03CR) 10Jbond: [C: 03+2] wmflib: new firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [16:27:23] !log deploy new firmware fact [16:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::kubernetes: add docker firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/765565 (owner: 10Giuseppe Lavagetto) [16:29:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21495 and previous config saved to /var/cache/conftool/dbconfig/20220224-162953-kormat.json [16:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:39] 10SRE: "Add your birthday" required for logging in to NOC - https://phabricator.wikimedia.org/T302508 (10mpopov) [16:34:09] 10SRE, 10SRE-Access-Requests: "Add your birthday" required for logging in to NOC - https://phabricator.wikimedia.org/T302508 (10Volans) [16:34:15] 10SRE, 10SRE-Access-Requests: "Add your birthday" required for logging in to NOC - https://phabricator.wikimedia.org/T302508 (10MatthewVernon) I managed to log in OK (via a verification code emailed to noc@)... [16:34:35] (03PS1) 10Jbond: O:puppetboard: Add graph firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765566 [16:34:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:08] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765566 (owner: 10Jbond) [16:35:42] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) Opened inbound ticket 00765408 to track down both of these shipments that arrived last week. [16:36:03] 10SRE, 10SRE-Access-Requests: "Add your birthday" required for logging in to NOC - https://phabricator.wikimedia.org/T302508 (10RhinosF1) @mpopov: where in the world are you trying to login from? Same to @MatthewVernon [16:36:43] 10SRE, 10SRE-Access-Requests: "Add your birthday" required for logging in to NOC - https://phabricator.wikimedia.org/T302508 (10mpopov) 05Open→03Resolved a:03mpopov I cancelled out and tried to log in again and this time it did not ask me for the birthday and let me proceed. WEIRD. [16:37:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2079.codfw.wmnet with reason: host reimage [16:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:15] (03CR) 10David Caro: [C: 03+2] r_lang: remove unused biocLite.R [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:37:27] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10mark) [16:37:59] (03PS1) 10JMeybohm: Revert "miscweb: bump staging to 2022-02-11-214428-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 [16:38:04] (03PS1) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 [16:38:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:33] (03PS2) 10JMeybohm: Revert "miscweb: bump staging to 2022-02-11-214428-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 [16:38:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33985/console" [puppet] - 10https://gerrit.wikimedia.org/r/765566 (owner: 10Jbond) [16:39:01] (03PS2) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) [16:40:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33986/console" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [16:40:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2079.codfw.wmnet with reason: host reimage [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetboard: Add graph firmware facts [puppet] - 10https://gerrit.wikimedia.org/r/765566 (owner: 10Jbond) [16:42:41] (03PS15) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [16:43:18] (03CR) 10JMeybohm: [C: 03+2] Revert "miscweb: bump staging to 2022-02-11-214428-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm) [16:43:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:57] (03PS1) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) [16:44:10] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300774)', diff saved to https://phabricator.wikimedia.org/P21496 and previous config saved to /var/cache/conftool/dbconfig/20220224-164458-kormat.json [16:45:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [16:45:02] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [16:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:45:06] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300774)', diff saved to https://phabricator.wikimedia.org/P21497 and previous config saved to /var/cache/conftool/dbconfig/20220224-164506-kormat.json [16:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:02] (03CR) 10JHathaway: wmflib: new firmware facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [16:46:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:47:11] (03PS2) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) [16:47:17] (03Merged) 10jenkins-bot: Revert "miscweb: bump staging to 2022-02-11-214428-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm) [16:47:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300774)', diff saved to https://phabricator.wikimedia.org/P21498 and previous config saved to /var/cache/conftool/dbconfig/20220224-164745-kormat.json [16:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:57] (03PS3) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) [16:48:52] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: set interface for link-local BGP IPv6 sessions [puppet] - 10https://gerrit.wikimedia.org/r/765544 (owner: 10Ssingh) [16:50:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [16:50:12] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:30] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) a:05RobH→03Jclark-ctr >>! In T301177#7734357, @LSobanski wrote: > Fine by me. Thanks, updated th... [16:50:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2079.codfw.wmnet with OS bullseye [16:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2079.codfw.wmnet with OS bullseye comple... [16:51:29] (03PS1) 10Vgutierrez: Drop Namecheap/Comodo CAA records for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/765569 [16:59:17] (03CR) 10BBlack: [C: 03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/765569 (owner: 10Vgutierrez) [16:59:26] (03CR) 10Vgutierrez: [C: 03+2] Drop Namecheap/Comodo CAA records for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/765569 (owner: 10Vgutierrez) [16:59:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2080.codfw.wmnet with OS bullseye [16:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2080.codfw.wmnet with OS bullseye [17:00:05] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:48] rien à faire ✅ [17:02:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21499 and previous config saved to /var/cache/conftool/dbconfig/20220224-170250-kormat.json [17:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:07] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:06:22] (03PS1) 10Giuseppe Lavagetto: deployment_server::kubernetes: use the correct docker firewall class [puppet] - 10https://gerrit.wikimedia.org/r/765571 [17:07:11] (03PS2) 10Krinkle: wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) [17:08:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::kubernetes: use the correct docker firewall class [puppet] - 10https://gerrit.wikimedia.org/r/765571 (owner: 10Giuseppe Lavagetto) [17:08:12] (03PS10) 10JHathaway: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) [17:08:57] * Krinkle staging on mwdebug1002 [17:11:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans) [17:11:17] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:11:20] (03CR) 10JHathaway: [C: 03+2] exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [17:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:34] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:43] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:59] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2080.codfw.wmnet with reason: host reimage [17:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Few small suggestions in line, slight improvements to the shell commands I'd tinkered with yesterday." [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [17:17:55] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21500 and previous config saved to /var/cache/conftool/dbconfig/20220224-171755-kormat.json [17:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2080.codfw.wmnet with reason: host reimage [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) cloudcephosd1025 E4 U21 cloudcephosd1026 E4 U22 cloudcephosd1027 E4 U23 cloudcephosd1028 E4 U24 cloudcephosd1029 E4 U25 cloud... [17:22:57] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic[1039,1043].eqiad.wmnet [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:21] (03CR) 10Dzahn: "Yes, please do. I could indeed not deploy this and did not know yet why that is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm) [17:28:24] (03PS1) 10JMeybohm: trafficserver: change miscweb backend to k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/765572 (https://phabricator.wikimedia.org/T290966) [17:30:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2080.codfw.wmnet with OS bullseye [17:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2080.codfw.wmnet with OS bullseye comple... [17:32:07] !log krinkle@deploy1002 Synchronized wmf-config/: Ia61fea4d0dcf86d51547d3132093a336ab3f2e9f (duration: 00m 52s) [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:00] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300774)', diff saved to https://phabricator.wikimedia.org/P21501 and previous config saved to /var/cache/conftool/dbconfig/20220224-173259-kormat.json [17:33:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [17:33:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [17:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:08] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:33:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300774)', diff saved to https://phabricator.wikimedia.org/P21502 and previous config saved to /var/cache/conftool/dbconfig/20220224-173307-kormat.json [17:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:34] (03CR) 10Jbond: wmflib: new firmware facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [17:34:56] (03PS1) 10Ebernhardson: mjolnir: Add python3-swiftclient for debugging [puppet] - 10https://gerrit.wikimedia.org/r/765573 [17:35:48] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300774)', diff saved to https://phabricator.wikimedia.org/P21503 and previous config saved to /var/cache/conftool/dbconfig/20220224-173548-kormat.json [17:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:37] !log `truncate -s 1g /var/log/auth.log` on krb1001 to free space on the root partition [17:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] (03CR) 10Krinkle: [C: 03+2] wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:39:36] (03Merged) 10jenkins-bot: wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:39:58] 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) p:05Triage→03High [17:40:36] !log `truncate -s 1g /var/log/auth.log.1` on krb1001 to free space on the root partition [17:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:42] (03CR) 10JHathaway: wmflib: new firmware facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [17:41:32] (03PS1) 10Jbond: firmware fact: drop firmware_bios [puppet] - 10https://gerrit.wikimedia.org/r/765574 [17:42:38] (03CR) 10jerkins-bot: [V: 04-1] firmware fact: drop firmware_bios [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [17:43:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2081.codfw.wmnet with OS bullseye [17:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2081.codfw.wmnet with OS bullseye [17:43:25] (03CR) 10Dzahn: [C: 03+1] "checked the new cert has the static-bugzilla SAN on it:" [puppet] - 10https://gerrit.wikimedia.org/r/765572 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [17:43:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:10] (03PS1) 10Ryan Kemper: elastic: officially decom 10[32-47] [puppet] - 10https://gerrit.wikimedia.org/r/765575 (https://phabricator.wikimedia.org/T294805) [17:44:15] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[1039,1043].eqiad.wmnet [17:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:44:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2082.mgmt.codfw.wmnet with reboot policy FORCED [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:45] (03PS2) 10Jbond: firmware fact: drop firmware_bios [puppet] - 10https://gerrit.wikimedia.org/r/765574 [17:48:41] (03PS2) 10Kevin Bazira: ml-services: add itwiki, jawiki & kowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) [17:48:56] (03CR) 10Jbond: firmware fact: drop firmware_bios (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [17:50:51] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:50:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21504 and previous config saved to /var/cache/conftool/dbconfig/20220224-175052-kormat.json [17:50:56] (03CR) 10Elukey: ml-services: add itwiki, jawiki & kowiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [17:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:11] (03CR) 10Bking: [C: 03+1] elastic: officially decom 10[32-47] [puppet] - 10https://gerrit.wikimedia.org/r/765575 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [17:54:13] (03PS1) 10Ebernhardson: cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) [17:54:33] (03CR) 10Elukey: [C: 03+2] ml-services: add itwiki, jawiki & kowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/765501 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [17:54:58] (03CR) 10jerkins-bot: [V: 04-1] cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [17:56:50] (03CR) 10JHathaway: [C: 03+1] firmware fact: drop firmware_bios (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [17:58:41] (03CR) 10Hnowlan: [C: 03+1] "one question, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/759894 (owner: 10MSantos) [17:59:33] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [17:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:13] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [18:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2081.codfw.wmnet with reason: host reimage [18:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:37] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [18:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:13] (03CR) 10Ebernhardson: "Dependent patch has been merged, shipped in version 0.3.104. spot checking wcqs instances shows we have 104 has been deployed and instance" [puppet] - 10https://gerrit.wikimedia.org/r/762527 (owner: 10Ebernhardson) [18:02:25] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [18:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:40] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) [18:02:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2082.mgmt.codfw.wmnet with reboot policy FORCED [18:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2081.codfw.wmnet with reason: host reimage [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2083.mgmt.codfw.wmnet with reboot policy FORCED [18:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:21] (03CR) 10Jbond: firmware fact: drop firmware_bios (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [18:05:44] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) [18:05:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:05:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21506 and previous config saved to /var/cache/conftool/dbconfig/20220224-180557-kormat.json [18:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:58] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:09:13] (03CR) 10Jbond: wmflib: new firmware facts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765534 (owner: 10Jbond) [18:09:18] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) [18:09:23] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10KFrancis) @Dzahn and @Ammarpad I am confirming the NDA has been signed. Please proceed with the access request [18:10:43] (03CR) 10Ryan Kemper: [C: 03+2] elastic: officially decom 10[32-47] [puppet] - 10https://gerrit.wikimedia.org/r/765575 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [18:13:04] (03CR) 10Jbond: "i plan to leave this until im back from vacation but others are more then welcome to progress in my absences if needed" [puppet] - 10https://gerrit.wikimedia.org/r/765574 (owner: 10Jbond) [18:13:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2081.codfw.wmnet with OS bullseye [18:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2081.codfw.wmnet with OS bullseye comple... [18:18:19] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, 10Sustainability (Incident Followup): Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10RKemper) [18:19:35] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10EChetty) [18:20:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2083.mgmt.codfw.wmnet with reboot policy FORCED [18:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300774)', diff saved to https://phabricator.wikimedia.org/P21508 and previous config saved to /var/cache/conftool/dbconfig/20220224-182102-kormat.json [18:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:08] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [18:21:33] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite lemme know what you prefer to test :) [18:21:47] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2082.codfw.wmnet with OS bullseye [18:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2082.codfw.wmnet with OS bullseye [18:23:28] (03CR) 10Andrew Bogott: [C: 04-1] "I think we should try to be consistent about 'domain' vs 'zone' terminology." [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [18:23:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [18:27:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2084.mgmt.codfw.wmnet with reboot policy FORCED [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:47] (03PS5) 10Herron: prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [18:28:07] (03CR) 10jerkins-bot: [V: 04-1] prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [18:29:08] (03PS6) 10Herron: prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [18:29:10] (03CR) 10Bking: [C: 03+1] mjolnir: Add python3-swiftclient for debugging [puppet] - 10https://gerrit.wikimedia.org/r/765573 (owner: 10Ebernhardson) [18:29:36] (03CR) 10Ryan Kemper: [C: 03+2] mjolnir: Add python3-swiftclient for debugging [puppet] - 10https://gerrit.wikimedia.org/r/765573 (owner: 10Ebernhardson) [18:29:40] (03CR) 10jerkins-bot: [V: 04-1] prometheus: sketch out proxied prometheus web with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [18:31:58] (03PS1) 10Ebernhardson: mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 [18:32:17] (03CR) 10jerkins-bot: [V: 04-1] mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:33:04] (03PS2) 10Ryan Kemper: mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:36:39] (03PS3) 10Ryan Kemper: mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:36:51] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:37:04] (03PS5) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [18:38:08] (03CR) 10Majavah: dynamicproxy: manage dns in the api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [18:39:42] (03PS1) 10Volans: wmf-netbox: fix UnboundLocalError [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 [18:39:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2082.codfw.wmnet with reason: host reimage [18:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:27] (03PS4) 10Ryan Kemper: mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:40:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:41:49] (03CR) 10Andrew Bogott: dynamicproxy: manage dns in the api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [18:42:58] (03PS3) 10Ryan Kemper: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:43:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2082.codfw.wmnet with reason: host reimage [18:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:43:40] (03PS6) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [18:43:44] (03CR) 10Dzahn: [C: 04-1] "As others have said, it seems the best way forward to put your check command into a timer/service (trying to avoid even calling it cron an" [puppet] - 10https://gerrit.wikimedia.org/r/764464 (owner: 10Andrew Bogott) [18:43:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2084.mgmt.codfw.wmnet with reboot policy FORCED [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:15] (03CR) 10Majavah: dynamicproxy: manage dns in the api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [18:45:04] (03CR) 10Cathal Mooney: [C: 03+1] "Suspect we never hit this as parent its always come first, but fix is correct thanks!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans) [18:45:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2085.mgmt.codfw.wmnet with reboot policy FORCED [18:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] (03CR) 10Volans: [V: 03+2 C: 03+2] wmf-netbox: fix UnboundLocalError [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans) [18:46:29] (03PS4) 10Ryan Kemper: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:46:36] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:46:49] (03CR) 10Ryan Kemper: [C: 03+2] mjolnir: Remove support for < stretch [puppet] - 10https://gerrit.wikimedia.org/r/765580 (owner: 10Ebernhardson) [18:47:10] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [18:51:16] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:41] (03PS1) 10Majavah: dynamicproxy: fix method call [puppet] - 10https://gerrit.wikimedia.org/r/765584 [18:51:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2085.mgmt.codfw.wmnet with reboot policy FORCED [18:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2085.mgmt.codfw.wmnet with reboot policy FORCED [18:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:21] (03PS2) 10Ebernhardson: mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) [18:52:54] (03PS5) 10Ryan Kemper: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:53:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2082.codfw.wmnet with OS bullseye [18:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2082.codfw.wmnet with OS bullseye comple... [18:53:55] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [18:54:33] (03CR) 10Dzahn: [C: 03+2] "tested 3 new queries and they all worked and were fast" [puppet] - 10https://gerrit.wikimedia.org/r/765245 (https://phabricator.wikimedia.org/T302385) (owner: 10Aklapper) [18:55:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2085.mgmt.codfw.wmnet with reboot policy FORCED [18:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2083.codfw.wmnet with OS bullseye [18:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2083.codfw.wmnet with OS bullseye [18:58:42] (03CR) 10Ryan Kemper: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [19:00:05] dduvall and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T1900). [19:00:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2084.codfw.wmnet with OS bullseye [19:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2084.codfw.wmnet with OS bullseye [19:00:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:45] (03PS1) 10Ryan Kemper: cirrus: add docu link for saneitizer alert [puppet] - 10https://gerrit.wikimedia.org/r/765585 (https://phabricator.wikimedia.org/T295365) [19:03:11] hashar: o/ rolling shortly [19:03:31] (03PS2) 10Ryan Kemper: cirrus: add docu link for saneitizer alert [puppet] - 10https://gerrit.wikimedia.org/r/765585 (https://phabricator.wikimedia.org/T295365) [19:03:34] transient failure for Check systemd state on netbox1001, it should recover [19:03:57] (03PS3) 10Ryan Kemper: cirrus: add docu link for saneitizer alert [puppet] - 10https://gerrit.wikimedia.org/r/765585 (https://phabricator.wikimedia.org/T295365) [19:04:10] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] cirrus: add docu link for saneitizer alert [puppet] - 10https://gerrit.wikimedia.org/r/765585 (https://phabricator.wikimedia.org/T295365) (owner: 10Ryan Kemper) [19:04:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:27] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765586 [19:04:29] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765586 (owner: 10Dduvall) [19:05:16] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765586 (owner: 10Dduvall) [19:05:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) [19:05:41] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10RKemper) Homer issues resolved, this decom ticket is ready to be worked. Note elastic1039 and elastic1043 likely need their disks manually wiped. [19:06:32] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.23 refs T300199 [19:06:35] (03PS1) 10Ryan Kemper: Revert "cirrus: add docu link for saneitizer alert" [puppet] - 10https://gerrit.wikimedia.org/r/765383 [19:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:10] T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199 [19:07:28] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "cirrus: add docu link for saneitizer alert" [puppet] - 10https://gerrit.wikimedia.org/r/765383 (owner: 10Ryan Kemper) [19:07:48] (03PS1) 10Ryan Kemper: Revert "cirrus: Alert when the rate of pages fixed by Saneitizer is too high" [puppet] - 10https://gerrit.wikimedia.org/r/765384 [19:08:06] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "cirrus: Alert when the rate of pages fixed by Saneitizer is too high" [puppet] - 10https://gerrit.wikimedia.org/r/765384 (owner: 10Ryan Kemper) [19:09:33] hmm. `Table 'wikishared.change_tag' doesn't exist`. that's a little worrying [19:10:27] only one error, however [19:11:50] (03PS1) 10Majavah: Revert "dynamicproxy: manage dns in the api" [puppet] - 10https://gerrit.wikimedia.org/r/765587 [19:16:43] (03PS1) 10Ryan Kemper: cirrus: Alert when # pages Saneitizer fixes high [puppet] - 10https://gerrit.wikimedia.org/r/765588 (https://phabricator.wikimedia.org/T295365) [19:20:06] (03PS2) 10Ryan Kemper: cirrus: Alert when # pages Saneitizer fixes high [puppet] - 10https://gerrit.wikimedia.org/r/765588 (https://phabricator.wikimedia.org/T295365) [19:21:58] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33989/console" [puppet] - 10https://gerrit.wikimedia.org/r/765588 (https://phabricator.wikimedia.org/T295365) (owner: 10Ryan Kemper) [19:22:33] dduvall: great. I am not there tonight though. [19:23:08] things hqve been quiet apparently, nobody raised any concern to me. [19:23:14] (03CR) 10Andrew Bogott: [C: 03+2] Revert "dynamicproxy: manage dns in the api" [puppet] - 10https://gerrit.wikimedia.org/r/765587 (owner: 10Majavah) [19:23:59] (03PS1) 10Dzahn: admin: add ammarpad to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) [19:24:32] (03CR) 10Bking: [C: 03+1] cirrus: Alert when # pages Saneitizer fixes high [puppet] - 10https://gerrit.wikimedia.org/r/765588 (https://phabricator.wikimedia.org/T295365) (owner: 10Ryan Kemper) [19:24:46] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] cirrus: Alert when # pages Saneitizer fixes high [puppet] - 10https://gerrit.wikimedia.org/r/765588 (https://phabricator.wikimedia.org/T295365) (owner: 10Ryan Kemper) [19:25:40] (03CR) 10Dzahn: "@mvernon also see [mwmaint1002:~] $ ldapsearch -x mail=ammarpad*" [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn) [19:26:32] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) Thank you @KFrancis as always. Uploaded code change for clinic duty person and reviewers to pick up. [19:27:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [19:30:24] (03PS3) 10Ryan Kemper: mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [19:30:33] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [19:30:56] hashar: no problem. yeah, looks good from here as well [19:31:34] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33990/console" [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [19:31:55] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+1] mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [19:32:01] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) (owner: 10Ebernhardson) [19:37:06] (03PS16) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [19:42:32] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:44:00] (03PS6) 10Ryan Kemper: query_service: Simplify jvm arg handling [puppet] - 10https://gerrit.wikimedia.org/r/761080 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [19:46:36] !log T302526 Disabling puppet across entire query service (wdqs & wcqs) fleet for merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/761080: `ryankemper@cumin1001:~$ sudo -E cumin 'w*qs*' 'disable-puppet "query_service: Simply jvm arg handling - T302526"'` [19:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:44] T302526: query_service: Simply jvm arg handling - https://phabricator.wikimedia.org/T302526 [19:48:27] !log T302526 Running puppet on wdqs canary: `ryankemper@wdqs1003:~$ sudo enable-puppet "query_service: Simply jvm arg handling - T302526" && sudo run-puppet-agent` [19:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:50] !log T302526 (Forgot to merge patch first, take two) [19:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:07] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Simplify jvm arg handling [puppet] - 10https://gerrit.wikimedia.org/r/761080 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [19:55:23] !log T302526 Depooled canary `wdqs1003`, ran puppet agent, and restarted `wdqs-blazegraph`. Tests look good, proceeding to rest of wdqs fleet [19:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:29] T302526: query_service: Simply jvm arg handling - https://phabricator.wikimedia.org/T302526 [19:56:58] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:57:25] !log T302526 `ryankemper@cumin1001:~$ sudo -E cumin -b 6 'wdqs*' 'enable-puppet "query_service: Simply jvm arg handling - T302526"; sudo run-puppet-agent'` in tmux `deploy_window` [19:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:36] (JobUnavailable) resolved: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:02:45] !log T302526 Depooled `wcqs1001`, ran puppet agent, and restarted `wcqs-blazegraph`. Service came up healthy, proceeding to rest of wcqs fleet [20:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:52] T302526: query_service: Simply jvm arg handling - https://phabricator.wikimedia.org/T302526 [20:04:00] !log T302526 `ryankemper@cumin1001:~$ sudo -E cumin -b 3 'wcqs*' 'enable-puppet "query_service: Simply jvm arg handling - T302526"; sudo run-puppet-agent'` in tmux `wcqs` [20:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:13] (03PS2) 10Ryan Kemper: wcqs: Provide access token secret to blazegraph logging [puppet] - 10https://gerrit.wikimedia.org/r/762527 (owner: 10Ebernhardson) [20:05:44] (03PS3) 10Ryan Kemper: wcqs: Provide access token secret to blazegraph logging [puppet] - 10https://gerrit.wikimedia.org/r/762527 (owner: 10Ebernhardson) [20:08:21] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: Provide access token secret to blazegraph logging [puppet] - 10https://gerrit.wikimedia.org/r/762527 (owner: 10Ebernhardson) [20:10:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2083.codfw.wmnet with OS bullseye [20:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2083.codfw.wmnet with OS bullseye execut... [20:12:56] (03PS1) 10Majavah: dynamicproxy: manage DNS records [puppet] - 10https://gerrit.wikimedia.org/r/765591 [20:14:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2084.codfw.wmnet with OS bullseye [20:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2084.codfw.wmnet with OS bullseye execut... [20:14:40] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: manage DNS records [puppet] - 10https://gerrit.wikimedia.org/r/765591 (owner: 10Majavah) [20:15:13] (03PS2) 10Majavah: dynamicproxy: manage DNS records [puppet] - 10https://gerrit.wikimedia.org/r/765591 [20:17:53] (03Abandoned) 10Majavah: dynamicproxy: fix method call [puppet] - 10https://gerrit.wikimedia.org/r/765584 (owner: 10Majavah) [20:18:05] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: manage DNS records [puppet] - 10https://gerrit.wikimedia.org/r/765591 (owner: 10Majavah) [20:27:16] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:28] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:16] Hmm.. I wonder if I broke something. [20:40:50] dancy: what you actually done? Just pushed the train? [20:47:38] (03PS1) 10Andrew Bogott: cinder backups: remove wikipathways and wikilink [puppet] - 10https://gerrit.wikimedia.org/r/765599 [20:48:11] (03PS1) 10Papaul: Add elastic208[3-6] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/765600 (https://phabricator.wikimedia.org/T299608) [20:49:10] (03CR) 10Papaul: [C: 03+2] Add elastic208[3-6] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/765600 (https://phabricator.wikimedia.org/T299608) (owner: 10Papaul) [20:51:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2084.codfw.wmnet with OS bullseye [20:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2084.codfw.wmn... [20:53:27] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: remove wikipathways and wikilink [puppet] - 10https://gerrit.wikimedia.org/r/765599 (owner: 10Andrew Bogott) [20:57:39] AndyRussG: hi, are you planning to try fixing the qqx thing in this window? i'm around in case i can help [20:58:13] Let me push a revert revert revert ready [20:58:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2084.codfw.wmnet with reason: host reimage [20:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:29] RhinosF1: i made on, on master [20:58:33] i made one* [20:58:36] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [20:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:51] MatmaRex: we already have a revert merged in master [20:58:57] Just not to the deployment branch [20:59:03] (03PS1) 10RhinosF1: Revert "Revert "Revert "Show message fallback keys when using &uselang=qqx""" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765626 [20:59:04] MatmaRex: RhinosF1 hiii! thanks so much [20:59:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2083.codfw.wmnet with OS bullseye [20:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2083.codfw.wmn... [20:59:24] MatmaRex: RhinosF1 if you think it's ok to try again now, that'd be fantastic [20:59:30] MatmaRex: I just pushed one to deployment branch [20:59:33] RhinosF1: you have a revert on master, and a revert + re-apply on wmf.23 [20:59:40] yeah, okay [20:59:46] so this is just a revert again [20:59:52] (apologies I just got back form getting my kids from school, and the other team member who knows about CentralNotice is out today) [20:59:53] MatmaRex: this a revert again ye [21:00:05] brennen: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T2100) [21:00:05] ebernhardson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:13] AndyRussG: yeah, it's fine if you weren't planning to, i just wanted to ask [21:00:17] here [21:00:20] RhinosF1 MatmaRex I am fully available to test now, yes it'd be great to do so [21:00:31] I just have no idea where the error came from [21:00:37] MatmaRex + andyrussG also has a patch brennen but no calendar [21:01:04] AndyRussG: neither do I but I dont know central notice well enough to guess mid window [21:01:11] ebernhardson: o/ [21:01:15] Hopefully it doesn't happen now [21:01:17] and though I understand how the revert should fix the bug, I'm not the dev who initially submitted the revert (he's the one who's out today) nor am I the one who +2'ed [21:01:23] o/ [21:01:24] RhinosF1: pretty sure it's not a centralnotice issue [21:01:30] at least not whatever appeared in the deployment tests [21:01:41] how about if we try pushing to a debug host and hammer on it a bit? [21:01:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2084.codfw.wmnet with reason: host reimage [21:01:53] AndyRussG: probably some stupid cache what caused the deployment issues this morning [21:01:54] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 03m 18s) [21:02:02] ahhh ok [21:02:09] brennen or thcipriani should be able to do it now [21:02:15] We have a window that just opened [21:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:30] cool yeah that'd be great, thx so much once again RhinosF1 and thx in advance brennen thcipriani [21:02:57] mepps: are you watching today again? [21:03:11] AndyRussG: hopefully I didn't scare the trainees this morning [21:03:26] (CI broke as well for unrelated reasons) [21:04:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2083.codfw.wmnet with reason: host reimage [21:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:21] RhinosF1: we are all hanging out in deployment training :D [21:04:32] thcipriani: nice :) [21:05:00] ah heheh trial by fire [21:05:07] !log removing 4 files for legal compilance [21:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:17] If they want to follow on, https://en.wikipedia.org/wiki/Wikipedia?banner=B2122_0131_itIT_dsk_p2_sm_twin1&force=1&country=US is the url we used to test this morning our patch [21:05:44] thcipriani: \o [21:05:49] RhinosF1: btw is that the URL that caused the error specifically, or was it some sort of generic test? [21:05:57] AndyRussG: that url [21:05:58] hey ebernhardson welcome [21:06:04] ok got it thx [21:06:06] I tried both you gave me [21:06:09] Same error [21:06:11] But that first [21:06:21] (03PS2) 10Ebernhardson: cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) [21:06:21] Tried on meta too instead of enwp [21:06:27] hmmm okok [21:06:49] thcipriani: do we want to +2 the backport first because it'll take like 15 minutes to merge [21:06:53] should do the same thing regardless of what wiki it's on, since the background request for the banner content always goes to meta anyway [21:07:00] (03PS3) 10Ebernhardson: cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) [21:07:16] yeah I guessed that but thought I'd try [21:07:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2083.codfw.wmnet with reason: host reimage [21:07:28] right [21:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:44] RhinosF1: AndyRussG sorry, catching up on backscroll -- where's the backport? I don't see it on the calendar [21:07:46] also if it was only that URL, and the error is from the in-banner JS, then leaving it deployed should be harmless [21:07:56] thcipriani: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/765626 [21:08:03] ah, thanks [21:08:04] thcipriani: not yet there, I can add it in tho [21:08:10] AndyRussG: that'd be great <3 [21:08:26] RhinosF1 just saw your ping from earlier--hi! yeah i'm in the training call :) [21:08:33] and hi AndyRussG :) [21:08:42] AndyRussG: https://en.wikipedia.org/wiki/Wikipedia?banner=B2122_0119_enWW_dsk_p1_lg_template&country=US is the other url I tried [21:08:49] mepps: hope you enjoy it [21:09:05] hey mepps :) [21:09:39] AndyRussG: could I get you to +1 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/765626 for me? [21:10:27] thcipriani: master has already been merged [21:10:40] ah, great, I didn't see the cherry-pick line [21:10:50] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "Revert "Show message fallback keys when using &uselang=qqx""" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765626 (owner: 10RhinosF1) [21:10:54] (03CR) 10Brennen Bearnes: [C: 03+2] cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [21:11:03] RhinosF1: pretty sure that error occurred for you, because you have the language set to British English [21:11:21] MatmaRex: oh, I love the quirks that causes sometimes [21:11:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2084.codfw.wmnet with OS bullseye [21:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:31] AndyRussG: apparently the code in that banner can't display the numbers in British English, which isn't great, but probably not a big deal [21:11:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2084.codfw.wmnet with OS bullseye comple... [21:11:38] (03Merged) 10jenkins-bot: cirrus: Reduce write isolation to only cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765577 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [21:12:01] MatmaRex RhinosF1 oh that'd explain a lot! [21:12:16] MatmaRex: RhinosF1 let's add uselang=en to the URL then [21:12:22] brennen: nothing i can really test on this one, the config var is only referenced from job runners [21:12:27] me + Reedy are like the only people who do British English [21:12:45] Chad does too [21:12:50] in shell [21:13:05] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-10 cirrussearch commonsfile outage - https://phabricator.wikimedia.org/T299967 (10herron) 05Open→03Resolved Scorecard has been filled in based on the info in the incident report [21:13:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2085.codfw.wmnet with OS bullseye [21:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2085.codfw.wmnet with OS bullseye [21:13:27] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10herron) [21:13:31] ebernhardson: ack. will deploy shortly. [21:13:53] MatmaRex: we should probably file a task or report somewhere for the British English issues [21:14:01] nothing wrong with using British English, but… when you see an error complaining about "Incorrect locale information provided", and it's coming from a URL with "en-gb" in it, then think about it ;) [21:14:33] RhinosF1: it's a bug in that specific banner [21:14:50] RhinosF1 MatmaRex I recall hearing some discussion about it but I don't remember what Fundraising component it was, but I can bring it up with the team [21:15:00] AndyRussG: great! [21:15:03] (well, and maybe anything else that shares the code, but i think it's not in the repos) [21:15:05] oh yeah it was for banner content indeed [21:15:09] yeah we'll dig in [21:15:36] * RhinosF1 is not very knowledgable on central notice [21:15:40] the banners themselves are not coded up by fr-tech, and most work on them is not tracked in Phab [21:16:03] fr-tech just handles CN, which is just for choosing which banner to show and injecting, and the banner admin system [21:16:31] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) I'm working on this, please stay tuned on this ticket; ETA tomorrow. [21:16:51] AndyRussG: you (or whoever writes those banners) might have similar issues in other languages. like pt-br or zh-tw etc. [21:17:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:11] review every place you use mw.centralNotice.data.uselang, and make sure that the code using it can handle dashes [21:18:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2083.codfw.wmnet with OS bullseye [21:18:24] MatmaRex: yeah hmmm now I'm thinking it might actually have been a caching issue... yes we do have language variant-specific banners... usually issues are caught quickly by the banner team [21:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2083.codfw.wmnet with OS bullseye comple... [21:19:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2086.codfw.wmnet with OS bullseye [21:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2086.codfw.wmnet with OS bullseye [21:19:33] !log removing 1 file for legal compliance [21:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:03] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Show message fallback keys when using &uselang=qqx""" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/765626 (owner: 10RhinosF1) [21:26:17] AndyRussG, MatmaRex, brennen: ^ [21:26:19] (K I added the revert revert revert to the calendar calendar calendar) [21:26:30] cool! :) [21:26:39] \o/ [21:26:55] \o/ [21:27:23] !log phabricator - disabling git repo rGEDS (Elasticdash) - only one commit from 2015 - T296022 [21:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:31] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [21:27:41] https://en.wikipedia.org/wiki/Wikipedia?banner=B2122_0131_itIT_dsk_p2_sm_twin1&force=1&country=US is the testing url for those that don't have it [21:28:18] RhinosF1: Sorry for the late response. Earlier today I built the mediawiki container images in a new way, so I was wondering if that alert might have been related. [21:28:31] Need to find some logs [21:28:35] dancy: ah! [21:29:07] !log brennen@deploy1002 Synchronized wmf-config/CirrusSearch-production.php: Config: [[gerrit:765577|cirrus: Reduce write isolation to only cloudelastic (T295705)]] (duration: 00m 55s) [21:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:13] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [21:29:41] RhinosF1: let's make it https://en.wikipedia.org/wiki/Wikipedia?banner=B2122_0131_itIT_dsk_p2_sm_twin1&force=1&country=US&uselang=en k? [21:30:02] AndyRussG: no problem. [21:30:32] It should be on the debug server for you to check in a moment [21:30:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2085.codfw.wmnet with reason: host reimage [21:30:35] you can test both en and en-gb, it will probably be fine for en, and you'll see the same error for en-gb [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:47] but it's a pre-existing issue [21:31:04] there's trainees so releng will be walking them through everything [21:31:09] jouncebot now [21:31:10] For the next 0 hour(s) and 28 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220224T2100) [21:31:55] RhinosF1, AndyRussG: been on other screens walking folks through deploy [21:32:14] brennen: no problem [21:32:15] patch is synched to mwdebug? should i pick up once tested? [21:32:32] brennen: which one? [21:32:59] K yeah also ready to test on a debug server [21:33:30] looking at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/765626 [21:34:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2085.codfw.wmnet with reason: host reimage [21:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:34] if it breaks and then later we need to add two more "revert"s to the commit message it might be worth it [21:34:47] lol [21:34:47] brennen: ye? It's just been merged by Jenkins. Which debug server is it at? Can you let us know when it's there? [21:34:59] RhinosF1: syncing to mwdebug1002 shortly, one second [21:35:11] dancy: I wonder what the longest string of reverts was [21:35:46] RhinosF1 brennen was just saying in the call he thought there'd been 7-8 [21:35:59] mepps: hehe [21:36:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2086.codfw.wmnet with reason: host reimage [21:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:41] RhinosF1, AndyRussG: on mwdebug1002 [21:37:54] MatmaRex: ^ [21:38:42] brennen: lgtm!!! :) :) [21:38:53] cool, synching [21:39:00] I'm not monitoring any logs to see the site isn't broken tho eh brennen [21:39:16] as long as the error went [21:39:21] yes? [21:39:21] We should be ok [21:39:32] MatmaRex: patch is at debug now [21:39:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2086.codfw.wmnet with reason: host reimage [21:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:49] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.23/includes: Backport: [[gerrit:765626|Revert "Revert "Revert "Show message fallback keys when using &uselang=qqx"""]] (duration: 00m 57s) [21:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:26] !log phabricator - disabled git repo "frig" - outdated fundraising stuff, checked with fr-tech, not needed T296022 [21:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:32] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [21:42:39] ok, seems like we're done. [21:42:46] AndyRussG: should be live now [21:42:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2085.codfw.wmnet with OS bullseye [21:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2085.codfw.wmnet with OS bullseye comple... [21:42:55] mepps: hope you enjoyed the fun I created [21:43:06] brennen: Can I take the deploy server now? [21:43:09] !log removing 1 file for legal compliance [21:43:12] dancy: all yours [21:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:14] thx [21:43:18] AndyRussG, MatmaRex, brennen: thanks for your help [21:43:32] and I hope our trainees had some fun [21:43:45] !log dancy@deploy1002 Started scap: testing scap container image building [21:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:58] RhinosF1: fantasmic thanks so much eh!!!!! [21:44:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:09] yeah looks great on live too, just tested with other banners that had the issues, all fine [21:44:15] RhinosF1 :) [21:44:34] AndyRussG: perfect! [21:44:53] mepps: please shout up if you lot have any Qs [21:45:04] I around for another like half an hour [21:45:30] Test completed. [21:45:43] !log end of UTC late backport & config window [21:45:43] that's much appreciated RhinosF1, still absorbing at the moment [21:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:51] No problem [21:46:12] thx so much again brennen MatmaRex RhinosF1 [21:46:22] np [21:46:24] any time [21:46:27] :) :) [21:50:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2086.codfw.wmnet with OS bullseye [21:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2086.codfw.wmnet with OS bullseye comple... [21:53:08] (03CR) 10Dzahn: [C: 03+2] "This is how to get content for bug 40023:" [puppet] - 10https://gerrit.wikimedia.org/r/765572 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [21:53:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [21:54:36] (03CR) 10Dzahn: "also worked from cp1079" [puppet] - 10https://gerrit.wikimedia.org/r/765572 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [21:54:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) 05Open→03Resolved @Gehel @RKemper all yours [21:57:51] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Add container image build settings [puppet] - 10https://gerrit.wikimedia.org/r/765624 (https://phabricator.wikimedia.org/T297673) [21:58:44] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:02:30] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) deployed @JMeybohm's change https://gerrit.wikimedia.org/r/c/operations/puppet/+/765572 Now this is behind the new istio ingress. I can see fresh traffic here: {F3496... [22:05:05] !log phabricator - disabled git repo - labs-tools-harvesting-data-refinery/repository/master/ [22:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:05] !log static-bugzilla.wikimedia.org - kubernetes - deployed gerrit:765572 - first prod service behind a k8s ingress (T290966) [22:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:10] T290966: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966 [22:07:58] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcontrol1003, elastic2086, cloudcontrol1005, elastic2083, deneb, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:27:41] (03CR) 10Dzahn: "on deneb on every puppet run: Notice: /Stage[main]/Package_builder/Package[pkg-js-tools]/ensure: created (corrective)" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [22:28:36] (03CR) 10Ladsgroup: "I understand it sounds stupid but isn't warning lower than error?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle) [22:30:47] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T302460" [puppet] - 10https://gerrit.wikimedia.org/r/765250 (owner: 10Jbond) [22:30:53] (03CR) 10Aaron Schulz: [C: 03+1] "I assume that "increase logging" means "increase the scope/volume of logging"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle) [22:33:56] (03PS1) 10Jbond: Revert "C:package_builder: install tools to build node packages" [puppet] - 10https://gerrit.wikimedia.org/r/765629 [22:34:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "C:package_builder: install tools to build node packages" [puppet] - 10https://gerrit.wikimedia.org/r/765629 (owner: 10Jbond) [22:36:06] jbond: we did not expect you to be awake at this time but I think people in -releng appreaciate this revert:) [22:36:32] no probs will need a new patch but will just take a sec [22:37:49] (03PS1) 10Jbond: C:package_builder: revert 765250 [puppet] - 10https://gerrit.wikimedia.org/r/765648 [22:38:15] mutante: can you review [22:39:27] jbond: is the node-babel7 part related? [22:39:43] or just not installing pkg-js-tools ..that part seems definitely +1 [22:40:33] mutante: yes i introduced both in the same patch and the cn both go [22:40:35] (03CR) 10Dzahn: [C: 03+1] C:package_builder: revert 765250 [puppet] - 10https://gerrit.wikimedia.org/r/765648 (owner: 10Jbond) [22:40:38] (03CR) 10Jbond: [C: 03+2] C:package_builder: revert 765250 [puppet] - 10https://gerrit.wikimedia.org/r/765648 (owner: 10Jbond) [22:40:45] +1, ty [22:43:02] confirmed on deneb. the "change on every run" is gone. we should soon see recovery of the "widespread" failures alert. but that is only caused by one more host because at all times we are close to the threshold, due to unrelated broken hosts [22:43:19] dancy: good in cloud? [22:43:27] testing... [22:44:40] Two puppet runs in a row on both machines.. lintian remains. Thanks ! [22:45:51] also no problem on build2001 (bullseye) fwiw [22:46:31] jbond: the worst part was clearly "Breaks: funny-manpages (<< 1.3-5.1)," just kidding. good night!:) thanks [22:46:51] that's "man baby" or something :) [22:47:33] lol [22:47:49] (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active - https://alerts.wikimedia.org [22:51:07] (03CR) 10Dzahn: "can see traffic and logs, screenshots at https://phabricator.wikimedia.org/T290966#7736435" [puppet] - 10https://gerrit.wikimedia.org/r/765572 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [23:01:31] 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ladsgroup) a:03Ladsgroup Public or private mailing list? With or without archive? [23:07:47] (03PS1) 10Ebernhardson: query_service: Repair oauth activation [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) [23:09:08] (03CR) 10jerkins-bot: [V: 04-1] query_service: Repair oauth activation [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [23:11:05] (03PS2) 10Ebernhardson: query_service: Repair oauth activation [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) [23:14:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [23:17:56] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) At least two (and sometimes three) instances of ES are using the same rundir, `/var/run/elasticsearch` . Because systemd works in parallel, a single failed instance... [23:18:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:25:07] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 61 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:25:33] (03CR) 10Ryan Kemper: [C: 03+1] query_service: Repair oauth activation [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [23:25:38] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Repair oauth activation [puppet] - 10https://gerrit.wikimedia.org/r/765652 (https://phabricator.wikimedia.org/T302526) (owner: 10Ebernhardson) [23:35:28] !log T302526 Deployed https://gerrit.wikimedia.org/r/765652 and ran puppet across wcqs* [23:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:37] T302526: query_service: Simply jvm arg handling - https://phabricator.wikimedia.org/T302526 [23:50:49] (03CR) 10Cwhite: [C: 03+1] prometheus: ditch automatic icmp probes for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [23:59:52] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Dzahn) I gave people on #wikmedia-vrt the summary of the incident report basically. Since some were wondering why they got mails at once etc. There is a doc...