[00:19:37] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:23:17] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:33:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:43:31] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:11] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:57] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:35:09] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:40:55] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:48:39] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:50:35] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:44:05] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:45:11] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:45:19] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1835 MB (6% inode=95%): /tmp 1835 MB (6% inode=95%): /var/tmp 1835 MB (6% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [02:46:00] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:48:01] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:49:51] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is OK: OK - Certificate kartotherian.discovery.wmnet will expire on Wed 13 Dec 2023 11:06:02 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:03:47] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:05:43] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:06:15] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [03:24:29] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:25:17] !log investigating PHD failure [03:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:26:36] !log restarted phd on phab1001 [03:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:37] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10Legoktm) Should services like Gerrit, Mailman, etc. be added to this? [03:30:49] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:32:41] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:54:30] (03CR) 10Legoktm: [C: 03+1] "Oops, and thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [05:49:23] PROBLEM - SSH on mw1297.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:08:31] (03CR) 10ArielGlenn: [C: 03+1] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/705096 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [06:16:35] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:18:29] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:50:13] RECOVERY - SSH on mw1297.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:52:28] (03PS2) 10Elukey: profile::kubernetes::master: add comments and improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) [06:52:30] (03PS1) 10Elukey: profile::kubernetes::master: add panel numbers to grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/705338 [06:55:01] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:06] (03PS1) 10Filippo Giunchedi: hieradata: add role/public_endpoint for o11y services [puppet] - 10https://gerrit.wikimedia.org/r/705342 [06:58:08] (03PS1) 10Filippo Giunchedi: hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 [07:02:35] (03Abandoned) 10Filippo Giunchedi: hieradata: add public o11y services to service::catalog [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676391 (owner: 10Filippo Giunchedi) [07:04:13] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @razzi @herron do you think that we can setup a quick meeting to discuss the next... [07:06:51] (03PS3) 10Elukey: profile::kubernetes::master: add comments and improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) [07:06:53] (03PS2) 10Elukey: profile::kubernetes::master: add panel numbers to grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/705338 [07:11:29] !log roll restart kafka mirror maker on kafka-main200* hosts - stuck after Friday's events/incident [07:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:56] mmmm no it didn't really help, weird [07:24:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [07:38:03] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:44:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [07:55:49] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:50] (03PS1) 10Vgutierrez: admin_state: Depool codfw text [dns] - 10https://gerrit.wikimedia.org/r/705348 (https://phabricator.wikimedia.org/T286787) [08:06:46] (03PS2) 10KartikMistry: Configure the Event Platform backend to accept events in the content_translation_event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [08:08:48] (03PS3) 10KartikMistry: Configure the Event Platform backend to accept events in the content_translation_event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [08:09:40] (03CR) 10Ayounsi: [C: 03+1] admin_state: Depool codfw text [dns] - 10https://gerrit.wikimedia.org/r/705348 (https://phabricator.wikimedia.org/T286787) (owner: 10Vgutierrez) [08:12:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_state: Depool codfw text [dns] - 10https://gerrit.wikimedia.org/r/705348 (https://phabricator.wikimedia.org/T286787) (owner: 10Vgutierrez) [08:13:25] (03CR) 10Vgutierrez: [C: 03+2] admin_state: Depool codfw text [dns] - 10https://gerrit.wikimedia.org/r/705348 (https://phabricator.wikimedia.org/T286787) (owner: 10Vgutierrez) [08:13:35] (03PS1) 10Filippo Giunchedi: switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) [08:15:30] !log depool codfw text traffic [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:00] (03PS4) 10ArielGlenn: add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) [08:16:55] (03CR) 10ArielGlenn: [C: 03+2] add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [08:17:12] (03CR) 10jerkins-bot: [V: 04-1] switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) (owner: 10Filippo Giunchedi) [08:18:29] (03CR) 10DCausse: "This listener was recently added for WDQS in https://gerrit.wikimedia.org/r/c/operations/puppet/+/676329 so I doubt it's being used for an" [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [08:19:16] (03PS2) 10Filippo Giunchedi: switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) [08:22:43] (03CR) 10Filippo Giunchedi: "Idea LGTM, you might want to add the same to thanos-swift since the configuration is quite similar. I can't meaningfully comment on whethe" [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [08:23:45] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: ingest navtiming & coal logs in Logstash [puppet] - 10https://gerrit.wikimedia.org/r/705030 (https://phabricator.wikimedia.org/T285897) (owner: 10Dave Pifke) [08:25:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "I don't feel like I can meaningfully comment, but seems sensible from a quick look" [puppet] - 10https://gerrit.wikimedia.org/r/705018 (owner: 10Cwhite) [08:26:32] (03Abandoned) 10Filippo Giunchedi: profile: fix swift symlink for WMCS LVs [puppet] - 10https://gerrit.wikimedia.org/r/516791 (owner: 10Filippo Giunchedi) [08:27:13] (03Abandoned) 10Filippo Giunchedi: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [08:27:28] 10ops-codfw, 10DC-Ops, 10Traffic: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:28:16] 10ops-codfw, 10DC-Ops, 10Traffic: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10ayounsi) Other rows need to be audited as well. [08:30:02] 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Majavah) [08:30:04] (03Abandoned) 10Filippo Giunchedi: swift: run swift-drive-audit staggered once a day [puppet] - 10https://gerrit.wikimedia.org/r/270970 (https://phabricator.wikimedia.org/T126574) (owner: 10Filippo Giunchedi) [08:37:21] (03PS1) 10Filippo Giunchedi: profile: fix rsyslog lookup table json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705350 (https://phabricator.wikimedia.org/T285897) [08:40:58] (03CR) 10JMeybohm: [C: 03+1] profile::kubernetes::master: add panel numbers to grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/705338 (owner: 10Elukey) [08:43:48] 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) >>! In T286879#7220243, @ayounsi wrote: > Other rows need to be audited as well. You're right, I've created T28... [08:44:44] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: fix rsyslog lookup table json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705350 (https://phabricator.wikimedia.org/T285897) (owner: 10Filippo Giunchedi) [08:46:17] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10RhinosF1) [08:47:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:47:35] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) [09:02:11] (03CR) 10Santhosh: [C: 04-1] "Need review from Neil" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [09:13:30] (03PS1) 10Filippo Giunchedi: rake: add generic json syntax check to CI [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) [09:13:32] (03PS1) 10Filippo Giunchedi: rake: replace conftool_schema with generic json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) [09:28:50] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) [09:30:25] !log bounce prometheus@k8s* on prometheus2004 due to cache not refreshing alert [09:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:44] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September), 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) a:03abi_ [09:31:55] PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [09:32:43] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September), 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) The patch has been deployed, and we can now translate the project here: https://transl... [09:35:11] (03PS2) 10Volans: rake: add generic json syntax check to CI [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:35:23] (03CR) 10Volans: "PS2 is to test a failure scenario" [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:35:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:35:39] (03CR) 10jerkins-bot: [V: 04-1] rake: add generic json syntax check to CI [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:37:51] PROBLEM - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [09:38:23] PROBLEM - Prometheus prometheus2004/k8s-mlserve restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s-mlserve [09:39:51] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:51] (03PS3) 10Volans: rake: add generic json syntax check to CI [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:40:02] (03CR) 10Volans: "and now a success one" [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:41:37] RECOVERY - Prometheus k8s cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [09:41:44] (03PS3) 10Cathal Mooney: Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) [09:42:21] (03PS4) 10Volans: rake: add generic json syntax check to CI [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:42:29] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:43:37] !log Running homer against cr2-eqdfw to change descr and move interface ae0, which connects to Facebook, into the external-links group. [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:20] (03CR) 10Volans: [C: 03+1] "The change does what it says, see PS2 and PS3 for a failure/success scenario. One comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [09:45:38] (03CR) 10MSantos: [C: 03+1] Maps: filter out non-administrative boundaries on OSM import [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [09:48:45] (03PS4) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [09:50:24] (03PS1) 10Btullis: Add the kstart package to all kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [09:52:17] !log imported megacli for bullseye-wikimedia T282272 T275873 [09:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [09:52:25] T282272: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 [09:57:57] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:59:26] (03CR) 10Elukey: Add the kstart package to all kerberos clients (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:59:51] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:01:01] RECOVERY - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [10:01:33] RECOVERY - Prometheus prometheus2004/k8s-mlserve restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s-mlserve [10:14:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:25] (03CR) 10Btullis: Add the kstart package to all kerberos clients (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [10:19:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org [10:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:41] (03PS1) 10Volans: Fix group assignement in CAS-SSO support [software/netbox] - 10https://gerrit.wikimedia.org/r/705358 [10:25:43] (03CR) 10Majavah: "I'm testing this on the metricsinfra puppetmaster, appears to work fine:" [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah) [10:29:24] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [10:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T1030). [10:40:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [10:48:59] (03CR) 10Jbond: [C: 03+1] "this lgtm, however i wonder if its overkill, as currently we can do, and if i understand everything works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T1100). [11:00:05] Martaannaj, michaelcochez, and DannyS712: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] o/ [11:00:13] I can deploy the beta config change [11:00:23] (well, “deploy” ;) ) [11:00:59] Martaannaj, michaelcochez: are you here? [11:01:19] did the bot message break in the second part? [11:01:31] yes! [11:01:35] I am here. [11:01:38] I feel like some text got eaten [11:01:39] Asartea: not really break, someone just added a sentence to the Wikitech template and the bot copies it [11:01:50] ah okay [11:01:54] I think on Wikitech there’s a
that the bot strips out [11:02:23] (03PS6) 10Lucas Werkmeister (WMDE): Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [11:02:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [11:03:13] (03Merged) 10jenkins-bot: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [11:04:17] (03PS1) 10Vgutierrez: acme_chief: Avoid hitting authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/705359 [11:05:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:703205|Add config for updated PropertySuggester beta cluster (T285098)]] (beta-only) (duration: 00m 57s) [11:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:18] T285098: Production A/B test deployment - Improved Property Suggester/Recommender - https://phabricator.wikimedia.org/T285098 [11:06:03] 10SRE, 10Infrastructure-Foundations: Setup new mirror server - https://phabricator.wikimedia.org/T286898 (10MoritzMuehlenhoff) [11:06:17] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:06:19] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:06:44] (03CR) 10Vgutierrez: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/30244/" [puppet] - 10https://gerrit.wikimedia.org/r/705359 (owner: 10Vgutierrez) [11:07:08] Martaannaj: beta update is currently running at https://integration.wikimedia.org/ci/job/beta-scap-sync-world/13711/console if I’m not mistaken [11:08:08] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Avoid hitting authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/705359 (owner: 10Vgutierrez) [11:08:10] should be deployed now, I think [11:09:09] Yes, looks that way. I can see the requests incoming to the cloud service [11:09:33] nice [11:09:34] (03CR) 10Jbond: "See inline comments, the unified diff (from git commit -v) and the comments are only present in the editor, they get striped from .git/COM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [11:09:43] (03CR) 10Jbond: [C: 04-1] puppetmaster: Stop commits to the private repo with empty messages [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [11:10:48] then I guess we should proceed to the other config change in the window [11:11:07] which frankly to me just looks like a waste of time, but I suppose leaving it unmerged isn’t very useful either [11:11:09] DannyS712: are you around? [11:12:22] (03CR) 10Jbond: dragonfly: Don't run pki::get_cert in ensure=absent case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [11:15:59] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:18:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:21:55] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:27:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox] - 10https://gerrit.wikimedia.org/r/705358 (owner: 10Volans) [11:31:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [11:31:36] !log EU backport+config window done [11:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:16] (03CR) 10Jbond: [C: 03+1] "LGTM, might be worth adding one of the conftool maintainers (or is that volans?) so they are aware" [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [11:34:25] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:35:15] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:36:21] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:36:37] ^ no idea what caused that kartotherian issue - service is depooled and very quiet in codfw afaict [11:37:51] (03CR) 10JMeybohm: dragonfly: Don't run pki::get_cert in ensure=absent case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [11:39:05] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:40:17] (03CR) 10JMeybohm: dragonfly: Don't run pki::get_cert in ensure=absent case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [11:40:30] !log installing bluez security updates [11:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:31] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:54:35] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 46 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:55:05] (03CR) 10Jbond: [C: 03+1] "lgtm (curious to see the use case)" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [11:56:59] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks all!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705351 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [11:58:18] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [11:59:33] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:00:06] (03PS4) 10KartikMistry: Add stream configuration for ContentTrnaslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [12:00:16] (03CR) 10KartikMistry: Add stream configuration for ContentTrnaslation events (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [12:00:51] (03CR) 10Jbond: dragonfly: Don't run pki::get_cert in ensure=absent case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:01:05] (03CR) 10jerkins-bot: [V: 04-1] Add stream configuration for ContentTrnaslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [12:02:52] (03PS5) 10KartikMistry: Add stream configuration for ContentTrnaslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [12:04:04] (03PS6) 10Jbond: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 [12:05:47] (03CR) 10Jbond: "I made a small change but this lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [12:12:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [12:15:10] (03CR) 10Jbond: [C: 03+1] "LGTM and now i see the example 😊" [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [12:15:51] (03CR) 10Jelto: [V: 03+1] "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:18:43] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10MoritzMuehlenhoff) [12:19:31] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10MoritzMuehlenhoff) [12:20:54] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10MoritzMuehlenhoff) [12:21:15] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10MoritzMuehlenhoff) >>! In T283242#7220035, @Legoktm wrote: > Should services like Gerrit, Mailman, etc. be added to... [12:21:35] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:21:54] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:22:07] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:29:20] 10SRE, 10SRE Observability, 10Patch-For-Review: Validate rsyslog json (possibly others) syntax in puppet CI - https://phabricator.wikimedia.org/T286882 (10fgiunchedi) [12:30:14] 10SRE, 10SRE Observability, 10Patch-For-Review: Validate json files syntax in puppet CI - https://phabricator.wikimedia.org/T286882 (10fgiunchedi) [12:30:55] 10SRE, 10Infrastructure-Foundations, 10Phabricator, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Also I don't think there is a way to get this... [12:33:50] (03CR) 10Jgiannelos: "Do you think merging it is going to break the current OSM sync? If not lets merge and see if/when we need to run a new import." [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [12:34:22] 10SRE, 10Infrastructure-Foundations, 10Phabricator, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) >>! In T286904#7221038, @Majavah wrote: > Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Tur... [12:34:36] (03CR) 10MSantos: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [12:35:31] (03CR) 10Jgiannelos: "Ok, so lets wait to merge when its time to run a new import." [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [12:40:38] 10SRE, 10Infrastructure-Foundations, 10Phabricator, 10CAS-SSO, 10User-jbond: Add logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10RhinosF1) [12:42:22] (03CR) 10Jbond: [C: 03+2] debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [12:42:41] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:47:12] (03PS4) 10Jbond: P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 [12:47:22] (03PS4) 10Jbond: C:trafficserver: use debian::autostart to prevent auto service start [puppet] - 10https://gerrit.wikimedia.org/r/701545 [12:47:31] (03PS5) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 [12:49:27] (03PS5) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 [12:50:25] (03PS2) 10Muehlenhoff: Enable debian::autostart on sretest* for some tests [puppet] - 10https://gerrit.wikimedia.org/r/704795 [12:51:57] (03PS2) 10Btullis: Add the kstart package to a kerberos client test [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [12:52:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704795 (owner: 10Muehlenhoff) [12:54:19] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:01:05] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw1414.eqiad.wmnet,service=canary [13:01:07] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts malmok.wikimedia.org [13:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:26] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw1415.eqiad.wmnet,service=canary [13:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:08] (03PS2) 10Ssingh: acme_chief: remove malmok's SNI and host from Wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/704125 (https://phabricator.wikimedia.org/T286480) [13:03:30] (03PS3) 10Btullis: Add the kstart package to a kerberos client test [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [13:03:56] (03PS1) 10Ssingh: site: remove decommissioned host malmok.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/705373 (https://phabricator.wikimedia.org/T286480) [13:03:59] (03PS4) 10Btullis: Add the kstart package to a kerberos client test [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [13:04:42] jouncebot: now [13:04:42] No deployments scheduled for the next 3 hour(s) and 55 minute(s) [13:04:52] * urbanecm messing up with mwdebug2001 to test something [13:06:16] (03PS4) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [13:06:38] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw127[3-5].eqiad.wmnet [13:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1272.eqiad.wmnet [13:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:56] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [13:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:35] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts malmok.wikimedia.org [13:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:42] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**PASS**) - Downtimed host on Icinga - Found Gan... [13:09:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1270.eqiad.wmnet [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:13] * urbanecm done messing up with mwdebug2001 [13:12:16] !log jayme@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:55] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:13:52] (03PS1) 10Ssingh: Remove malmok.wikimedia.org from anycast_neighbors in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/705374 (https://phabricator.wikimedia.org/T286480) [13:14:49] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:20:17] **Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2) [13:20:41] jayme: for you as well, it seems [13:21:34] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw1270.eqiad.wmnet [13:21:36] mutante: for me :) [13:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1270.eqiad.wmnet` - m... [13:22:24] sukhe: that makes 3 of us I guess :) [13:23:18] sukhe: so is the fail just because we were running it at the same time or something else? [13:23:23] :) I was looking how to skip authdns2001 manually and to run it again [13:23:34] ah, of course [13:23:41] that's down , ack [13:24:58] hmm --skip-authdns-update ? [13:26:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1272.eqiad.wmnet [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:22] (03PS1) 10Vgutierrez: pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) [13:29:50] (03CR) 10jerkins-bot: [V: 04-1] pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) (owner: 10Vgutierrez) [13:29:56] mutante: ah right! so have you run it already? [13:30:04] (03PS1) 10Btullis: Add a CNAME for analytics-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) [13:31:05] !log sukhe@cumin1001 START - Cookbook sre.dns.netbox [13:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:16] (03CR) 10RLazarus: "> Patch Set 1:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [13:31:48] (03PS2) 10Vgutierrez: pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) [13:32:15] (03CR) 10jerkins-bot: [V: 04-1] pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) (owner: 10Vgutierrez) [13:32:34] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:29] 10SRE, 10DNS: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) p:05Triage→03High [13:33:30] (03PS3) 10Vgutierrez: pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) [13:33:45] (03PS1) 10Volans: sre.dns.netbox: skip authdns2001 because offline [cookbooks] - 10https://gerrit.wikimedia.org/r/705378 (https://phabricator.wikimedia.org/T286914) [13:35:31] 10SRE, 10DNS, 10Patch-For-Review: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) Before repooling authdns2001, once it's back online, we need to: * revert https://gerrit.wikimedia.org/r/705378 * follow https://wikitech.wikimedia.org/wiki/DNS/Netb... [13:37:18] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: skip authdns2001 because offline [cookbooks] - 10https://gerrit.wikimedia.org/r/705378 (https://phabricator.wikimedia.org/T286914) (owner: 10Volans) [13:38:05] (03PS4) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [13:39:52] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) One interesting effect is that, since the datacenter... [13:40:18] (03Merged) 10jenkins-bot: sre.dns.netbox: skip authdns2001 because offline [cookbooks] - 10https://gerrit.wikimedia.org/r/705378 (https://phabricator.wikimedia.org/T286914) (owner: 10Volans) [13:40:23] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/703848 (owner: 10Giuseppe Lavagetto) [13:41:33] !log volans@cumin2002 START - Cookbook sre.dns.netbox [13:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] sukhe: no, but I see a fix is coming :) [13:42:05] sukhe, mutante, jayme, effie ^^^ I'm forcing a run to sync the existing changes to all reachable dns osts [13:42:18] thanks volans ! [13:42:47] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw1272.eqiad.wmnet [13:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:57] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1272.eqiad.wmnet` - m... [13:43:08] mutante: last change was the decom of mw1272.eqiad.wmnet, is that correct? [13:43:23] volans: that is correct, mw1270, then mw1272 [13:43:50] will do: mw1270,mw1272,mw1273,mw1274,mw1275 [13:44:26] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:34] all done, you can now run it safely as usual. Btw mutante you can decom multiple hosts at the same time if you want, it's also quicker as it runs only once dns and homer [13:46:23] (03PS5) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [13:46:44] thanks volans! [13:47:04] thank you [13:47:12] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts malmok.wikimedia.org [13:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1273-1275].eqiad.wmnet [13:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:32] * volans really need to prioritze the locking of cookbooks, not all together as the dns part will allow only one to commit if at the same time :) [13:53:51] volans: again, your crystal ball! [13:54:11] mutante: ^ ok to remove mw[1273-1275].eqiad.wmnet? [13:54:15] normally I prefer one host at a time, also because of the sanity checks if they are still found [13:54:29] sukhe: how did you win that race? yes [13:55:11] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts malmok.wikimedia.org [13:55:12] sukhe: I think I was first anyways :p [13:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:19] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**FAIL**) - **Failed downtime host on Icinga (like... [13:55:19] haha [13:55:20] great [13:55:24] (03PS14) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [13:55:26] (03PS7) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [13:55:27] hmm [13:55:36] the downtime thing is something else [13:55:58] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts malmok.wikimedia.org [13:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:35] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:57:02] (03CR) 10Ssingh: [C: 03+2] site: remove decommissioned host malmok.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/705373 (https://phabricator.wikimedia.org/T286480) (owner: 10Ssingh) [13:57:10] 128 Warnings but no errors [13:57:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1273-1275].eqiad.wmnet [13:57:27] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: dumps-rsyncer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:34] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1273-1275].eqiad.wmn... [13:58:13] (03CR) 10Ssingh: [C: 03+2] acme_chief: remove malmok's SNI and host from Wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/704125 (https://phabricator.wikimedia.org/T286480) (owner: 10Ssingh) [13:58:30] (03PS1) 10Vgutierrez: lvs: Set depool_threshold to .66 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/705381 (https://phabricator.wikimedia.org/T274888) [13:58:55] thanks volans! [13:58:57] vgutierrez: with receipts! [14:01:36] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts malmok.wikimedia.org [14:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `malmok.wikimedia.org` - malmok.wikimedia.org (**FAIL**) - **Failed downtime host on Icinga (like... [14:04:56] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom mw1270,mw1272,mw1273,mw1274,mw1275 [puppet] - 10https://gerrit.wikimedia.org/r/704966 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [14:05:01] (03PS3) 10Dzahn: site/conftool: decom mw1270,mw1272,mw1273,mw1274,mw1275 [puppet] - 10https://gerrit.wikimedia.org/r/704966 (https://phabricator.wikimedia.org/T280203) [14:07:03] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704966 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [14:08:18] (03PS2) 10Jelto: site: assign role gitlab to gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/704801 (https://phabricator.wikimedia.org/T285870) [14:09:24] (03CR) 10Ssingh: [C: 03+2] lvs: Set depool_threshold to .66 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/705381 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [14:10:57] (03CR) 10Elukey: Add a CNAME for analytics-presto.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:14:48] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [14:16:15] (03CR) 10Jelto: [C: 03+2] site: assign role gitlab to gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/704801 (https://phabricator.wikimedia.org/T285870) (owner: 10Jelto) [14:16:19] (03CR) 10Elukey: "Left a couple of comments, can you also run the puppet compiler on some nodes to see NO-OPs vs changes?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:19:16] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [14:20:40] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Fix typo in dfdaemon config template [puppet] - 10https://gerrit.wikimedia.org/r/705382 (https://phabricator.wikimedia.org/T286054) [14:23:01] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Fix typo in dfdaemon config template [puppet] - 10https://gerrit.wikimedia.org/r/705382 (https://phabricator.wikimedia.org/T286054) [14:23:09] !log rolling restart of ulsfo pybal instances [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:31] (03CR) 10Btullis: Add the kstart package to a kerberos client test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:24:06] (03PS1) 10ArielGlenn: swap roles of dumpsdata1001, 1003 in prep for T286065 [puppet] - 10https://gerrit.wikimedia.org/r/705384 [14:25:26] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:25:39] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10Andrew) cloudcontrol* hosts are all upgraded to 1:10.3.29-0+deb10u1 [14:26:48] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:28:26] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Fix typo in dfdaemon config template [puppet] - 10https://gerrit.wikimedia.org/r/705382 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:28:48] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:29:44] (03PS2) 10ArielGlenn: swap roles of dumpsdata1001, 1003 in prep for T286065 [puppet] - 10https://gerrit.wikimedia.org/r/705384 [14:29:48] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:33:00] !log rolling restart of eqsin pybal instances [14:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:10] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [14:34:14] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:35:12] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) > investigate switching ganati cluster certificates to cfssl As far as i can tell the only thing that uses RAPI are netbox and the nrpe check.... [14:35:14] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:36:25] these ml people [14:36:33] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:36:35] (03CR) 10Herron: [C: 03+1] "cursory check lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/705018 (owner: 10Cwhite) [14:36:39] (03PS1) 10Dzahn: site/conftool: add mw1434,mw1435,mw1436 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/705385 (https://phabricator.wikimedia.org/T279309) [14:36:46] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:37:06] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: add mw1434,mw1435,mw1436 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/705385 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:38:52] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [14:39:07] (03PS2) 10Dzahn: site/conftool: add mw1434,mw1435,mw1436 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/705385 (https://phabricator.wikimedia.org/T279309) [14:40:11] (03CR) 10Filippo Giunchedi: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [14:40:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:41:20] (03PS3) 10ArielGlenn: swap roles of dumpsdata1001, 1003 in prep for T286065 [puppet] - 10https://gerrit.wikimedia.org/r/705384 [14:42:15] (03PS5) 10Btullis: Add the kstart package to a kerberos client test [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [14:42:27] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:42:27] !log rolling restart of codfw pybal instances [14:42:32] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1434,mw1435,mw1436 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/705385 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1434-1436].eqiad.wmnet with reason: new host [14:43:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1434-1436].eqiad.wmnet with reason: new host [14:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:29] (03CR) 10Herron: [C: 03+1] hieradata: add role/public_endpoint for o11y services [puppet] - 10https://gerrit.wikimedia.org/r/705342 (owner: 10Filippo Giunchedi) [14:44:48] (03PS4) 10ArielGlenn: swap roles of dumpsdata1001, 1003 in prep for T286065 [puppet] - 10https://gerrit.wikimedia.org/r/705384 [14:45:34] (03CR) 10Herron: [C: 03+1] hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [14:45:44] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:45:50] (03CR) 10ArielGlenn: [C: 03+2] swap roles of dumpsdata1001, 1003 in prep for T286065 [puppet] - 10https://gerrit.wikimedia.org/r/705384 (owner: 10ArielGlenn) [14:50:51] 10SRE, 10DNS, 10Traffic: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Vgutierrez) [14:52:22] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) We can disable an account over ssh with `gerrit set-account --inactive` or via the REST API https://gerrit.wikimedia.org/r/Documentation/rest-api-accounts.html... [14:52:31] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) [14:53:21] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) sure, sounds good to me! [14:53:23] (03CR) 10Btullis: "Ran an integration test as suggested." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:53:31] (03CR) 10Muehlenhoff: "Looks good, one suggestion inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:56:09] (03PS1) 10Effie Mouzeli: profile::trafficserver: include mwdebug.discovery.wmnet in X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/705406 (https://phabricator.wikimedia.org/T286491) [14:59:04] !log rolling restart of eqiad pybal instances [14:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:51] (03CR) 10Effie Mouzeli: [C: 03+1] pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) (owner: 10Vgutierrez) [15:00:13] (03CR) 10Muehlenhoff: [C: 03+1] "This was acked in today's Infrastructure Foundations SRE meeting, but all new access groups need an approval line; i.e. the manager who's " [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [15:00:27] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:00:45] (03CR) 10Filippo Giunchedi: "> Thanks for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [15:01:23] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add role/public_endpoint for o11y services [puppet] - 10https://gerrit.wikimedia.org/r/705342 (owner: 10Filippo Giunchedi) [15:01:28] (03PS2) 10Filippo Giunchedi: hieradata: add role/public_endpoint for o11y services [puppet] - 10https://gerrit.wikimedia.org/r/705342 [15:01:33] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:01:36] (03CR) 10RLazarus: icinga: Add type hints to icinga-status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus) [15:03:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable debian::autostart on sretest* for some tests [puppet] - 10https://gerrit.wikimedia.org/r/704795 (owner: 10Muehlenhoff) [15:04:24] (03CR) 10Jbond: [C: 03+1] "> Hmm, that's not what I found when I was testing this. I just tried it with a commit-msg hook that just cats the file, and it looks like " (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [15:05:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/705374 (https://phabricator.wikimedia.org/T286480) (owner: 10Ssingh) [15:05:31] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:05:45] (03CR) 10Ssingh: [C: 03+2] pybal: Consider forcely pooled hosts on check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/705375 (https://phabricator.wikimedia.org/T286913) (owner: 10Vgutierrez) [15:05:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] thanos-swift envoy listener: rewrite HTTP host header [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [15:06:06] (03CR) 10Ssingh: [C: 03+2] Remove malmok.wikimedia.org from anycast_neighbors in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/705374 (https://phabricator.wikimedia.org/T286480) (owner: 10Ssingh) [15:06:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:07:12] (03Merged) 10jenkins-bot: Remove malmok.wikimedia.org from anycast_neighbors in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/705374 (https://phabricator.wikimedia.org/T286480) (owner: 10Ssingh) [15:10:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:10:57] !log +100G to prometheus/ops in codfw [15:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:54] !log ran homer for Gerrit 705374: Remove malmok.wikimedia.org from anycast_neighbors in codfw [15:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:33] (03CR) 10Muehlenhoff: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [15:20:07] (03CR) 10Dave Pifke: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:20:21] (03PS3) 10Dave Pifke: Fix NavtimingStaleBeacon false alarms, add test [alerts] - 10https://gerrit.wikimedia.org/r/702477 [15:23:04] (03CR) 10Herron: "This change is ready for review." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/699260 (owner: 10Herron) [15:24:43] (03CR) 10Btullis: "Updated compiler run: https://puppet-compiler.wmflabs.org/compiler1001/30249/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [15:26:14] (03CR) 10Filippo Giunchedi: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:32:45] (03CR) 10Dave Pifke: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:37:17] (03PS1) 10Filippo Giunchedi: README.md: document minimum Prometheus version [alerts] - 10https://gerrit.wikimedia.org/r/705412 [15:38:20] (03CR) 10Filippo Giunchedi: [C: 03+2] README.md: document minimum Prometheus version [alerts] - 10https://gerrit.wikimedia.org/r/705412 (owner: 10Filippo Giunchedi) [15:38:50] (03PS2) 10Krinkle: logos/manage.py: Set user-agent on all requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [15:39:27] (03CR) 10Filippo Giunchedi: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:40:37] (03PS3) 10Krinkle: logos/manage.py: Set user-agent on all requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [15:40:39] (03CR) 10Effie Mouzeli: [C: 03+2] thanos-swift envoy listener: rewrite HTTP host header [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [15:40:42] (03CR) 10Krinkle: [C: 03+2] logos/manage.py: Set user-agent on all requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [15:40:55] thanks Krinkle [15:41:54] (03Merged) 10jenkins-bot: logos/manage.py: Set user-agent on all requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [15:43:10] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:39] (03PS4) 10Krinkle: CommonSettings: Restore wgCSPFalsePositiveUrls for intuition.toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701094 (https://phabricator.wikimedia.org/T207900) [15:45:46] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Restore wgCSPFalsePositiveUrls for intuition.toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701094 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:46:29] (03Merged) 10jenkins-bot: CommonSettings: Restore wgCSPFalsePositiveUrls for intuition.toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701094 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:49:12] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:35] (03PS2) 10Krinkle: InitialiseSettings: Add toolforge.org to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701003 (https://phabricator.wikimedia.org/T285364) [15:52:44] (03CR) 10Krinkle: [C: 03+2] InitialiseSettings: Add toolforge.org to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701003 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [15:53:05] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I069c7b53 (duration: 00m 58s) [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:30] (03Merged) 10jenkins-bot: InitialiseSettings: Add toolforge.org to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701003 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [15:55:25] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:20] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:57] (03CR) 10Krinkle: "Verified via https://www.mediawiki.org/w/index.php?title=Project%3ASandbox&type=revision&diff=4710837&oldid=4710836. rel=nofollow went awa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701003 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [15:58:36] (03PS4) 10Krinkle: InitialiseSettings: Change wgEntitySchemaShExSimpleUrl to toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701093 (https://phabricator.wikimedia.org/T285364) [15:59:22] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I2bdfbd258e (duration: 00m 57s) [15:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:29] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:02:12] (03CR) 10Krinkle: [C: 03+2] InitialiseSettings: Change wgEntitySchemaShExSimpleUrl to toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701093 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [16:02:36] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [16:03:09] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:03:15] (03Merged) 10jenkins-bot: InitialiseSettings: Change wgEntitySchemaShExSimpleUrl to toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701093 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [16:04:58] (03CR) 10Krinkle: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701093 (https://phabricator.wikimedia.org/T285364) (owner: 10Krinkle) [16:05:10] addshore: I've tested it for you, but async FYI ^ [16:05:49] awesome, thanks! [16:08:17] (03CR) 10Krinkle: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:08:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1434.eqiad.wmnet [16:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw143[5-6].eqiad.wmnet [16:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:44] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) [16:11:39] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw143[4-6].eqiad.wmnet [16:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:15] !log mw1434, mw1435, mw1436 - new API appservers in production, pooled first time [16:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:04] (03PS9) 10Juan90264: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) [16:14:35] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:15:15] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 310be45f7 (duration: 00m 57s) [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:20] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30250/console" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:17:56] (03CR) 10RLazarus: [V: 03+1 C: 03+2] arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:18:38] !log dancy@deploy1002 Started deploy [gerrit/gerrit@4f29981]: Gerrit to 3.2.11 on gerrit2001 [16:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@4f29981]: Gerrit to 3.2.11 on gerrit2001 (duration: 00m 10s) [16:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:36] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1284.eqiad.wmnet [16:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw128[6-7].eqiad.wmnet [16:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=logstash2021.codfw.wmnet [16:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:11] !log depooled logstash2021 for dcops maintenance work [16:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] !log upgrading gerrit replica on gerrit2001 and restarting [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:54] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:51] PROBLEM - Host logstash2021 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:12] (03PS1) 10Arturo Borrero Gonzalez: cloud: dumps NFS: failback dumps NFS to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/705417 (https://phabricator.wikimedia.org/T286600) [16:24:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2021.codfw.wmnet with reason: maintenace [16:24:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2021.codfw.wmnet with reason: maintenace [16:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:51] ACKNOWLEDGEMENT - Host logstash2021 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn maintenance work ongoing [16:26:49] PROBLEM - Host logstash2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:28:39] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:56] volans: should I expect to be able to downtime mgmt hosts by cookbook? or just servers [16:29:45] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:29:50] (currently it says "No hosts provided" when I give it a mgmt name) [16:30:28] I am ignoring the gerrit Icinga alerts because I happen to have read a notice from Hashar that they are working on it. (scheduled downtime would be nice) [16:30:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:30:35] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:47] mutante: I should have acked it for maintenance sorry [16:31:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2001.wikimedia.org with reason: maintenance [16:31:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2001.wikimedia.org with reason: maintenance [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:42] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [16:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: maintenance [16:31:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: maintenance [16:31:56] hashar: np, done! [16:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:59] for an hour [16:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:35] (03CR) 10Bstorm: "My first reaction is that this is a great idea! I'll give it a review today." [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah) [16:33:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1284.eqiad.wmnet [16:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:55] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:34:58] mutante: it's possible in spicerack, I don't recall if the cookbook was updated for that or not, checking [16:35:23] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:35:48] volans: it seems to not recognize hosts with "mgmt". just reporting though [16:37:11] (or maybe it could be an option --with-mgmt to include that with the main server just thinking out loud) [16:38:52] RECOVERY - Host logstash2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [16:38:56] (03PS9) 10Juan90264: Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) [16:39:15] (03PS8) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [16:39:30] (03PS4) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [16:39:38] (03PS4) 10Juan90264: Adding square wordmark for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) [16:39:43] (03PS5) 10Juan90264: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) [16:39:54] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:04] !log update asw-a2-codfw serial number - T286787 [16:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:11] T286787: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 [16:40:12] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [16:40:14] !log dancy@deploy1002 Started deploy [gerrit/gerrit@4f29981]: Gerrit to 3.2.11 on gerrit1001 [16:40:16] !log Upgrading gerrit1001 with dancy & brennen [16:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:22] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@4f29981]: Gerrit to 3.2.11 on gerrit1001 (duration: 00m 08s) [16:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:55] (03CR) 10jerkins-bot: [V: 04-1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [16:45:15] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1284.eqiad.wmnet [16:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:26] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1284.eqiad.wmnet` - m... [16:46:59] mutante: patch incoming [16:47:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:50] volans: :)) [16:47:53] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:55] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:59] there really was no rush, but cool:) [16:48:25] ^ guessing those are gerrit upgrade fallout [16:48:31] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:49] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:58] getting some “155 of 455 shards failed” errors in logstash when searching, is my search bad or is something going on in logstash? [16:50:20] (searching “url:/wiki/Special:EntityData/” in wikidata) [16:50:37] (s/in wikidata/with server.keyword:www.wikidata.org/ – it’s in logstash ^^) [16:50:49] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1286-1287].eqiad.wmnet [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:28] thcipriani: yeah, having some gerrit upgrade issues [16:51:32] ok, url:"/wiki/Special:EntityData/*" produces no errors so I guess it was my fault for making bad searches [16:51:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=logstash2021.codfw.wmnet [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:43] Lucas_WMDE: one logstash server was under short maintenance and is now pooled again. But it would be a surprise becuase that was depooled [16:52:58] RECOVERY - Host logstash2021 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [16:53:18] mutante: I can still reproduce the message by removing the quotation marks, so it’s probably the search string [16:53:30] Lucas_WMDE: ok, ACK [16:54:06] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:54:56] !log gerrit up and running with manual configuration edit to use ipv4 address [16:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:13] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:22] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:01] brennen: the new version would not work with v6? [16:56:32] (03PS1) 10Jbond: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) [16:57:01] 10SRE, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10herron) >>! In T253810#6185337, @fgiunchedi wrote: >I've PoC this with check_ipmi_sensor which supports checking SEL > ... >The downside of this approach is... [16:57:29] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10RLazarus) a:03KFrancis Hi @KFrancis -- could you please set Elena up with an NDA, then assign back to me? Thank you! [16:58:10] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:15] (03PS2) 10Jbond: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) [16:59:17] (03CR) 10Volans: [C: 03+1] "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus) [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T1700). [17:00:10] Gerrit up and running [17:00:38] there is just some annoyance in the gerrit config that tries to setup sshd TWICE on the ipv6 port 29418 [17:00:50] so we have kept puppet disabled while it is figured out [17:01:40] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:01:47] (03PS1) 10Volans: sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 [17:01:49] (03PS1) 10Volans: sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 [17:01:53] mutante: ^^^ (the first one) [17:02:14] (03CR) 10Jbond: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [17:02:22] I'm tempted to rename it sre.icinga.downtime at this point, but I might cause havoc, so I might add a symlink later and slowly convince people to use the other one ;) [17:02:30] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:53] (Emergency syslog message) firing: Emergency syslog message - https://alerts.wikimedia.org [17:03:00] RECOVERY - Juniper virtual chassis ports on asw-a-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [17:03:37] volans: cool, looks useful, just for the record, my case was about adding a downtime rather than removing one, not sure if that is covered as well [17:04:08] !log enable cr1-codfw / et-0/0/0 - T286787 [17:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:14] T286787: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 [17:04:34] will spicerack.remote().query(args.query).hosts find a host like logstash2021.mgmt.codfw.wmnet ? [17:05:00] that's to add a downtime, the remove-downtime cookbook already supports that [17:05:15] no, ofc not, you have to specify --force and it will use whatever you pass [17:05:30] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:05:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30254/console" [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [17:05:40] we might later on add a check that the hosts exists in the icinga status.dat file, we already have a parser for that [17:06:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:06:38] volans: gotcha! ok, cool. micro nitpick: there is string "Donwtime" on line 69 [17:06:53] whoops [17:07:13] copy paste from the remove-downtime, fixing in both [17:07:53] (Emergency syslog message) resolved: Emergency syslog message - https://alerts.wikimedia.org [17:08:48] (03PS2) 10Volans: sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 [17:08:50] (03PS2) 10Volans: sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 [17:08:52] (03PS1) 10Volans: sre.hosts.remove-downtime: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 [17:09:28] (03CR) 10Dzahn: [C: 03+1] "thanks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 (owner: 10Volans) [17:09:36] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:10:26] !log enable asw-a2-codfw access ports - T286787 [17:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:33] T286787: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 [17:11:03] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [17:11:03] RECOVERY - Host ms-fe2005 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [17:11:12] RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [17:11:26] RECOVERY - Host ms-be2029 is UP: PING OK - Packet loss = 0%, RTA = 31.90 ms [17:11:41] (03PS1) 10Volans: Revert "acme_chief: Avoid hitting authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705446 [17:11:47] (03CR) 10Jbond: [V: 03+1] P:gerrit: Add logoutd script for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [17:11:51] (03PS1) 10Volans: Revert "sre.dns.netbox: skip authdns2001 because offline" [cookbooks] - 10https://gerrit.wikimedia.org/r/705447 [17:12:58] RECOVERY - Host kafka-logging2001 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [17:13:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1286-1287].eqiad.wmnet [17:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:09] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1286-1287].eqiad.wmn... [17:13:15] mutante: could you halt a second with the decommissioning? [17:13:20] we're reverting the changes for the authdns2001 [17:13:56] RECOVERY - Host ms-be2044 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [17:14:07] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:17] volans: I am done since a few seconds ago [17:14:20] (03CR) 10Volans: [C: 03+2] Revert "acme_chief: Avoid hitting authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705446 (owner: 10Volans) [17:14:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [17:14:23] won't do anymore today [17:14:28] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus) [17:14:30] (03PS2) 10Volans: Revert "acme_chief: Avoid hitting authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705446 [17:14:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [17:14:45] ack thx [17:15:02] RECOVERY - Host thanos-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 31.51 ms [17:15:34] rzl: lmk when you are done with puppet-merge [17:15:36] volans: okay to merge yours? [17:15:40] or go ahead and include with my changes [17:15:41] aha [17:15:41] yes please [17:15:42] doing [17:15:45] thx [17:15:48] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [17:15:58] and done [17:16:00] (03CR) 10Volans: [C: 03+2] Revert "sre.dns.netbox: skip authdns2001 because offline" [cookbooks] - 10https://gerrit.wikimedia.org/r/705447 (owner: 10Volans) [17:16:24] RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [17:16:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:46] (03PS1) 10Hashar: gerrit: listen on IPv4 rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) [17:16:58] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 32.68 ms [17:17:02] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Search [17:17:09] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:16] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [17:17:30] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:44] RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 34.11 ms [17:18:08] (03CR) 10Hashar: "That should address the issue we have encountered while restarting Gerrit today. Need to review the puppet compiler output." [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [17:18:14] RECOVERY - Elasticsearch HTTPS for production-search-codfw on elastic2037 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 763 days) https://wikitech.wikimedia.org/wiki/Search [17:19:01] (03Merged) 10jenkins-bot: Revert "sre.dns.netbox: skip authdns2001 because offline" [cookbooks] - 10https://gerrit.wikimedia.org/r/705447 (owner: 10Volans) [17:19:18] RECOVERY - Host elastic2055 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [17:19:24] RECOVERY - Host authdns2001 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [17:19:42] I am off for dinner, have my phone nearby if needed [17:19:54] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:19:56] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:59] and later this evening I will check the gerrit config fix ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/705431 ) [17:20:32] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:20:40] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:21:20] !log remove ns1 redirect - T286787 [17:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:27] T286787: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 [17:21:38] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) switch backup online and Netbox update [17:22:03] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 76.83 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [17:23:05] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:35] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@978b674]: (no justification provided) [17:23:37] !log running authdns-update to force-update authdns2001 [17:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:57] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@978b674]: (no justification provided) (duration: 00m 21s) [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:21] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@978b674]: (no justification provided) [17:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:42] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@978b674]: (no justification provided) (duration: 00m 21s) [17:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:05] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@978b674]: (no justification provided) [17:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:26] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@978b674]: (no justification provided) (duration: 00m 21s) [17:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:59] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@978b674]: (no justification provided) [17:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:15] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@978b674]: (no justification provided) (duration: 00m 16s) [17:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:22] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@978b674]: (no justification provided) [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:36] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@978b674]: (no justification provided) (duration: 00m 14s) [17:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:33] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1001/861/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [17:27:39] (03CR) 10RLazarus: [C: 03+1] elastic: Fix timer to fire continually [puppet] - 10https://gerrit.wikimedia.org/r/704567 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [17:28:00] (03CR) 10Ryan Kemper: [C: 03+2] elastic: Fix timer to fire continually [puppet] - 10https://gerrit.wikimedia.org/r/704567 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [17:28:08] 10SRE, 10DNS, 10Traffic: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) [17:28:29] 10SRE, 10DNS, 10Traffic: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 (10Volans) 05Open→03Resolved a:03Volans All done, resolving for now. [17:28:55] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2038 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:29:13] (03CR) 10Volans: [C: 03+2] sre.hosts.remove-downtime: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 (owner: 10Volans) [17:30:58] !log running puppet on elastic2038 after nework was restored [17:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:05] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:34:27] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:34:33] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03193 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [17:39:26] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:45] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [17:40:54] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [17:41:43] !log [Elastic] Restarted elasticsearch services on `elastic2038`; afterwards restarted prometheus exporters; no units failed any longer [17:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:18] !log [Elastic] Noted `Jul 16 18:31:20 elastic2038 elasticsearch[957]: 2021-07-16 18:31:20,657 main ERROR Unknown GELF server hostname:udp:logstash.svc.eqiad.wmnet` in elasticsearch service logs (unit had been running for 2 days) thus the restart of the elasticsearch service [17:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:48] ryankemper: do you still have a puppet-merge going? no rush, I know you have a lot going on right now [17:45:06] (03CR) 10Hashar: "> [sshd]" [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [17:45:16] rzl: yeah I was waiting to hear back from jbond but it looks like his change is a no-op anyway [17:45:27] rzl: I'll merge both now [17:45:32] ahh got it [17:45:34] thanks! [17:45:46] and yeah my assumption is he's off for the evening [17:46:09] ah of course, totally forgot he was eu :) [17:46:13] rzl: okay merged all 3 of ours [17:46:21] 👍 [17:47:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:51:47] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@82e5f94]: (no justification provided) [17:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:04] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@82e5f94]: (no justification provided) (duration: 00m 16s) [17:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:24] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@82e5f94]: (no justification provided) [17:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:40] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@82e5f94]: (no justification provided) (duration: 00m 15s) [17:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:03] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@82e5f94]: (no justification provided) [17:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:25] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@82e5f94]: (no justification provided) (duration: 00m 21s) [17:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:35] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@82e5f94]: (no justification provided) [17:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:56] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@82e5f94]: (no justification provided) (duration: 00m 22s) [17:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:14] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@82e5f94]: (no justification provided) [17:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:36] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@82e5f94]: (no justification provided) (duration: 00m 22s) [17:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10cmooney) Basic plan for bringing this online should be: # Add device to Netbox as cloudsw2-c8-eqiad, I guess below existing cloudsw1? # Allocate SCS and MGMT SW ports in Netbox... [17:59:02] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2038 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:59:47] (03PS1) 10Dduvall: pipeline: Perform mergeMessageFileList and rebuildLocalisationCache separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705437 [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T1800) [18:00:05] apergos: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:37] To Whom It May Concern: I'm here, those two patches need to go in so the dumps are not broken tonight and for the start of tomorrow's run [18:00:53] I can't self-serve because at this very minute my team is in a virtual offsite [18:01:34] I have already tested the patch combination for dumps on a live wiki with the problem being fixed [18:01:55] (03CR) 10Ahmon Dancy: [C: 03+2] pipeline: Perform mergeMessageFileList and rebuildLocalisationCache separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705437 (owner: 10Dduvall) [18:02:58] (03Merged) 10jenkins-bot: pipeline: Perform mergeMessageFileList and rebuildLocalisationCache separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705437 (owner: 10Dduvall) [18:03:34] apergos: I can do the backports since I have one going out. [18:04:06] dancy: that would be excellent. I am pingable if anything is needed [18:04:10] ok [18:05:39] (03CR) 10Brennen Bearnes: [C: 03+1] "Looks like the right thing, assuming `ipv4` is the correct value." [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [18:06:50] !log dancy@deploy1002 Synchronized .pipeline: Config: [[gerrit:705437|pipeline: Perform mergeMessageFileList and rebuildLocalisationCache separately]] (duration: 00m 56s) [18:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:00] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:18] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) a:03RLazarus Hi Janina, welcome to the Foundation! I can get you set up. Thanks for signing L3 -- in... [18:07:18] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:07:24] (03PS1) 10Ahmon Dancy: Add sanity check to newRevisionFromRowAndSlots. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705448 (https://phabricator.wikimedia.org/T286877) [18:09:14] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:56] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:16] !log Running homer to re-enable asw-a2-codfw xe-2/0/45 port [lvs2007] [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] nice. thanks, dancy [18:15:34] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: listen on IPv4 rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [18:15:40] (03CR) 10Eevans: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [18:17:10] RECOVERY - Host lvs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [18:18:42] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:18:42] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:18:43] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:19:03] ^^ that's expected [18:19:31] !log T264053 Deploying fix for timer issue: `ryankemper@cumin1001:~$ sudo cumin -b 36 'P{elastic*}' 'sudo systemctl stop elasticsearch-disable-readahead.timer && sudo systemctl disable elasticsearch-disable-readahead.timer && rm -fv /etc/systemd/system/elasticsearch-disable-readahead.timer && rm -fv /usr/lib/systemd/system/elasticsearch-disable-readahead.timer && sudo run-puppet-agent'` [18:19:31] (03PS1) 10Dzahn: site/conftool/DHCP: decom mw1284,mw1286,mw1287 [puppet] - 10https://gerrit.wikimedia.org/r/705439 (https://phabricator.wikimedia.org/T280203) [18:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:37] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [18:20:29] !log enabling pybal on lvs2007 - T286921 [18:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:36] RECOVERY - pybal on lvs2007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:20:36] RECOVERY - PyBal backends health check on lvs2007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:20:37] T286921: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 [18:20:56] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:59] (03CR) 10Ahmon Dancy: [C: 03+1] "PCC results look good:" [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [18:21:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:21:42] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:22] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:22:32] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:22:57] !log disable puppet & stop pybal on lvs2010 - T286921 [18:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/30256/" [puppet] - 10https://gerrit.wikimedia.org/r/705439 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [18:24:06] (03CR) 10Thcipriani: [C: 03+1] gerrit: listen on IPv4 rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [18:25:22] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) Oops, one more thing: @Ottomata can you approve for analytics-privatedata-users please? I'll include K... [18:27:07] !log T264053 Deploying fix for timer issue on cloudelastic: `ryankemper@cumin1001:~$ sudo cumin -b 6 'P{cloudelastic*}' 'sudo systemctl stop elasticsearch-disable-readahead.timer && sudo systemctl disable elasticsearch-disable-readahead.timer && rm -fv /etc/systemd/system/elasticsearch-disable-readahead.timer && rm -fv /usr/lib/systemd/system/elasticsearch-disable-readahead.timer && sudo run-puppet-agent'` [18:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:12] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:27:13] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [18:27:25] ^^ that's me stopping pybal on lvs2010 [18:27:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:27:35] !log Running homer to re-enable port xe-2/0/44 on asw2-a2-codfw (lvs2010) [18:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:43] !log T264053 Deploying fix for timer issue on relforge: `ryankemper@cumin1001:~$ sudo cumin -b 2 'P{relforge*}' 'sudo systemctl stop elasticsearch-disable-readahead.timer && sudo systemctl disable elasticsearch-disable-readahead.timer && rm -fv /etc/systemd/system/elasticsearch-disable-readahead.timer && rm -fv /usr/lib/systemd/system/elasticsearch-disable-readahead.timer && sudo run-puppet-agent'` [18:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:32] PROBLEM - pybal on lvs2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:29:16] (03CR) 10Dzahn: [C: 03+2] gerrit: listen on IPv4 rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/705431 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [18:29:39] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:29:40] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [18:30:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:05] (03PS1) 10Andrew Bogott: Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) [18:31:24] ryankemper: can I merge your change [18:31:28] rzl: can I merge your change [18:31:44] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:31:53] mutante: oh, I thought it was done, yes please [18:32:09] thanks, I was just puzzling out why I wasn't seeing it in prod :D [18:32:13] ryankemper: please go ahead and do the "multiple", you got the lock now :) [18:32:19] rzl: ack, thanks [18:32:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:32:53] (03CR) 10Ahmon Dancy: [C: 03+2] Add sanity check to newRevisionFromRowAndSlots. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705448 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [18:34:56] RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:35:34] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:35:46] !log running puppet and restarting pybal on lvs2010 - T286921 [18:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:53] T286921: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 [18:36:46] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:36:52] (03CR) 10Bstorm: "You shouldn't need to fail over all sections. You just need the sections that 19 and 20 use." [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [18:37:00] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:37:04] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:38:06] RECOVERY - pybal on lvs2010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:38:54] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:38:56] !log re-enabling puppet on gerrit1001] [18:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:24] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 78 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [18:39:26] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:40:24] !log stop pybal on lvs2009 - T286921 [18:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:03] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [18:41:56] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:43:08] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Host not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:03] dancy: when were my patches to go? [18:44:19] they need to go soon-ish so that this evening's adds changes run is not broken [18:44:32] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation={list,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:44:46] Still working on it. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/705448 is the first cherry pick. Waiting for gate-and-submit to complete. [18:44:54] PROBLEM - pybal on lvs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:44:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:45:02] ok whew (and thakns) [18:45:03] and gate-and-submit failed for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/705346 [18:45:09] er? [18:45:10] I'll run a re-check [18:45:20] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:46:02] 21:16:54 548 | ERROR | [x] Whitespace found at end of line 21:16:54 | | (Squiz.WhiteSpace.SuperfluousWhitespace.EndLine) [18:46:03] ffs [18:46:07] haha [18:46:14] sorry.. whitespace checkers.. [18:46:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:46:19] can you just shove it through anyways? [18:46:28] ok [18:46:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:46:31] !log gerrit1001: restarting gerrit [18:46:33] ^^ that's me stopping pybla on lvs2009 [18:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:42] I can leave that tab open and submit a whitespace frop later (tomorrow :-P) [18:46:51] !log Running homer to re-enable port xe-2/0/43 on asw2-a2-codfw (lvs2009) - T286921 [18:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:58] T286921: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 [18:47:05] actually, if I merge the change, will it cause test failures for other changes? [18:47:06] apergos: won't that break CI [18:47:10] ^ [18:47:14] Which will be unbreak anyway [18:47:16] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:47:25] I don't know [18:47:30] It probably will. [18:47:32] apergos: I'd say yes [18:47:34] it's already been +2 [18:47:43] so what's the right procedure? [18:47:54] Fix it [18:47:57] Hmm. yeah.. how did it get through the first layers of testing and only discovered during gate-and-submit.. .weird. [18:47:58] I mean it's a separate commit and we can't rewrite history [18:48:29] I can send a whitespace patch immediately and that can.. wait 40 ominutes to clear jenkins and etc [18:48:42] whatever you think is best, and sorry to have put you inthis position [18:48:49] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [18:48:54] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:49:05] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [18:49:08] apergos: How much time do we have until the next run of your programs? [18:49:26] they will start at [18:49:56] in 2 hours [18:50:23] ok, I think we can get this cleaned up by then. [18:50:38] ok, let me know what I should be doing here since I created this mess [18:52:01] (03CR) 10jerkins-bot: [V: 04-1] Add sanity check to newRevisionFromRowAndSlots. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705448 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [18:52:40] RECOVERY - configured eth on lvs2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:53:12] !log running puppet and restarting pybal on lvs2009 - T286921 [18:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:18] T286921: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 [18:53:22] Cripes. gate-and-submit failed for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/705448 too (different reason). [18:53:29] (03CR) 10Ahmon Dancy: [C: 03+2] "recheck" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705448 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [18:53:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:03] that onewas on jenkins I'm pretty sure [18:54:06] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:24] RECOVERY - pybal on lvs2009 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:54:39] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2004.codfw.wmnet:4001 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [18:55:03] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Host not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:22] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:55:37] 10SRE, 10Traffic, 10WikimediaDebug, 10Performance-Team (Radar): Allow ATS to route traffic to mwdebug deployment on kubernetes - https://phabricator.wikimedia.org/T286482 (10dpifke) The debug extension now fetches the list of backends from noc.wikimedia.org, so this hopefully shouldn't require any changes... [18:55:43] ACKNOWLEDGEMENT - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan host not pooled https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:56:28] (03PS2) 10Andrew Bogott: Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) [18:56:49] 10SRE, 10Traffic: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [18:56:55] (03CR) 10jerkins-bot: [V: 04-1] Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [18:56:58] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:57:48] (03PS1) 10Vgutierrez: Revert "admin_state: Depool codfw text" [dns] - 10https://gerrit.wikimedia.org/r/705466 (https://phabricator.wikimedia.org/T286921) [18:59:05] (03PS3) 10Andrew Bogott: Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) [19:08:01] 10ops-eqiad, 10decommission-hardware: decommission payments1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286942 (10Jgreen) [19:09:23] 10ops-eqiad, 10decommission-hardware: decommission payments1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286943 (10Jgreen) [19:10:18] 10ops-eqiad, 10decommission-hardware: decommission payments1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286944 (10Jgreen) [19:12:25] (03Merged) 10jenkins-bot: Add sanity check to newRevisionFromRowAndSlots. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705448 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [19:14:03] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.14/includes/Revision/RevisionStore.php: Backport: [[gerrit:705448|Add sanity check to newRevisionFromRowAndSlots. (T286877)]] (duration: 00m 57s) [19:14:07] apergos: One down, one to go! [19:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:10] T286877: newRevisionSlots argument check breaks stub dumps for svwiki - https://phabricator.wikimedia.org/T286877 [19:14:23] \o/ [19:15:17] 10ops-eqiad, 10decommission-hardware: decommission payments1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286942 (10Jgreen) [19:15:19] 10ops-eqiad, 10decommission-hardware: decommission payments1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286943 (10Jgreen) [19:15:21] 10ops-eqiad, 10decommission-hardware: decommission payments1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286944 (10Jgreen) [19:19:07] (03PS1) 10Jgreen: Remove payments100[1-4].frack.eqiad.wmnet A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705486 (https://phabricator.wikimedia.org/T286044) [19:20:15] (03CR) 10Jgreen: [C: 03+2] Remove payments100[1-4].frack.eqiad.wmnet A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705486 (https://phabricator.wikimedia.org/T286044) (owner: 10Jgreen) [19:21:33] !log authdns-update to remove payments100[1-4].frack.eqiad.wmnet [19:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:00] 10ops-eqiad, 10decommission-hardware: decommission payments1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286945 (10Zabe) [19:29:07] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10KFrancis) @RLazarus Sure, no problem. Would you please send me Elena's email address? [19:32:48] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10RLazarus) (Just sent it privately.) [19:32:59] (03PS4) 10Andrew Bogott: Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) [19:34:36] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:39:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dduvall) >>! In T285232#7215756, @dduvall wrote: > Helm supports hooks. What if we define pre-install hook and a k... [19:42:50] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:46:00] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:49:13] (03PS1) 10Ahmon Dancy: prevent PageIdentity checks in RevisionStore from breaking xml dumps [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705467 (https://phabricator.wikimedia.org/T286877) [19:49:23] (03CR) 10Ahmon Dancy: [C: 03+2] prevent PageIdentity checks in RevisionStore from breaking xml dumps [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705467 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [19:50:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T2000). [20:06:14] (03Merged) 10jenkins-bot: prevent PageIdentity checks in RevisionStore from breaking xml dumps [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705467 (https://phabricator.wikimedia.org/T286877) (owner: 10Ahmon Dancy) [20:08:31] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.14/includes/export/WikiExporter.php: Backport: [[gerrit:705467|prevent PageIdentity checks in RevisionStore from breaking xml dumps (T286877)]] (duration: 00m 58s) [20:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:39] T286877: newRevisionSlots argument check breaks stub dumps for svwiki - https://phabricator.wikimedia.org/T286877 [20:08:57] apergos: Both of your commits have been deployed [20:09:15] thank you!! how did you get around the whitespace issue? [20:09:36] I uploaded a patchset w/ the trailing whitespace removed. [20:10:16] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/705346/5..6/includes/export/WikiExporter.php [20:10:32] I also reordered the 'use' clause to remove another non-fatal warning. [20:13:37] oh the use clause, grrrr [20:13:40] thanks for all that [20:14:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:18:26] !log volans@cumin2002 START - Cookbook sre.dns.netbox [20:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:23:09] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:32] (03PS2) 10Vgutierrez: Revert "admin_state: Depool codfw text" [dns] - 10https://gerrit.wikimedia.org/r/705466 (https://phabricator.wikimedia.org/T286921) [20:26:19] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10RobH) [20:26:39] (03PS1) 10Zabe: Add patroller group for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705498 (https://phabricator.wikimedia.org/T285221) [20:27:49] (03CR) 10Vgutierrez: [C: 03+2] Revert "admin_state: Depool codfw text" [dns] - 10https://gerrit.wikimedia.org/r/705466 (https://phabricator.wikimedia.org/T286921) (owner: 10Vgutierrez) [20:28:15] (03PS2) 10Zabe: Add patroller group for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705498 (https://phabricator.wikimedia.org/T285221) [20:29:27] !log pool text@codfw - T286921 [20:29:29] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] T286921: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 [20:29:47] (03PS1) 10Hashar: [WMF] its-phabricator: Urlencode POST to conduit [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) [20:30:10] 10SRE, 10Traffic, 10Patch-For-Review: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) [20:30:42] 10SRE, 10Traffic, 10Patch-For-Review: Actions to restore lvs2009/lvs2010 network configuration - https://phabricator.wikimedia.org/T286921 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [20:31:20] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:34:44] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:36:33] (03CR) 10Bstorm: [C: 03+1] Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [20:37:12] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:37:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:38:44] (03CR) 10Andrew Bogott: [C: 03+2] Move wikireplica service off of clouddb1019/1020 [puppet] - 10https://gerrit.wikimedia.org/r/705441 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [20:40:04] (03PS1) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [20:42:41] jouncebot: now [20:42:42] For the next 0 hour(s) and 17 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T2000) [20:42:53] anyone mind shipping me a quick security patch? [20:42:57] (03PS1) 10Andrew Bogott: Revert "Move wikireplica service off of clouddb1019/1020" [puppet] - 10https://gerrit.wikimedia.org/r/705501 (https://phabricator.wikimedia.org/T286598) [20:45:48] apparently not, going ahead [20:45:51] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:48:11] !log Deploy security patch for T286884 [20:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:33] (03CR) 10Hashar: "The update is a single commit from upstream :]" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [20:49:40] * urbanecm done [20:49:57] (03CR) 10Andrew Bogott: [C: 03+1] "yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [20:50:08] (03PS1) 10Hashar: gerrit: remove escapeUri [puppet] - 10https://gerrit.wikimedia.org/r/705503 (https://phabricator.wikimedia.org/T280197) [20:50:29] (03CR) 10Bstorm: [C: 03+2] cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [20:51:18] (03CR) 10Hashar: [C: 04-1] "We need to deploy the its-phabricator plugin at the same time. It is pending on https://gerrit.wikimedia.org/r/c/operations/software/gerri" [puppet] - 10https://gerrit.wikimedia.org/r/705503 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [20:55:59] (03PS2) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [20:56:28] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:57:26] (03PS1) 10Bstorm: Revert "cloud galera: have haproxy shut down sessions when marked" [puppet] - 10https://gerrit.wikimedia.org/r/705468 [20:58:00] PROBLEM - glance-api http on cloudcontrol1003 is CRITICAL: connect to address 208.80.154.23 and port 9292: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:59:02] (03CR) 10Bstorm: [C: 03+2] Revert "cloud galera: have haproxy shut down sessions when marked" [puppet] - 10https://gerrit.wikimedia.org/r/705468 (owner: 10Bstorm) [20:59:16] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:59:55] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/699260 (owner: 10Herron) [21:00:05] Reedy and sbassett: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T2100). [21:00:22] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:42] (03CR) 10Cwhite: [C: 03+1] hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [21:01:10] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:44] RECOVERY - glance-api http on cloudcontrol1003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1394 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:12] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [21:02:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:02:49] jerkins >:( [21:04:04] (03PS3) 10Cwhite: logstash: normalize_level should modify only relevant parts of log obj [puppet] - 10https://gerrit.wikimedia.org/r/705018 [21:06:38] (03PS3) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [21:07:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:12:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [21:12:47] (03PS4) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [21:15:21] (03PS1) 10Zabe: Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 [21:18:29] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [21:19:04] (03PS5) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [21:21:09] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:22:18] (03PS2) 10Zabe: Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 [21:24:49] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:26:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:28:24] (03PS1) 10Bstorm: cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/705507 (https://phabricator.wikimedia.org/T286675) [21:29:57] (03CR) 10Bstorm: "This version has the config in the correct place. This is a server option, not a listen block option." [puppet] - 10https://gerrit.wikimedia.org/r/705507 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [21:32:16] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:33:50] (03PS4) 10Cwhite: logstash: normalize_level should modify only relevant parts of log obj [puppet] - 10https://gerrit.wikimedia.org/r/705018 [21:35:28] (03PS1) 10Ssingh: auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705508 [21:36:32] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:38:16] (03CR) 10Cwhite: [C: 03+2] logstash: normalize_level should modify only relevant parts of log obj [puppet] - 10https://gerrit.wikimedia.org/r/705018 (owner: 10Cwhite) [21:43:30] (03CR) 10Andrew Bogott: [C: 03+1] cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/705507 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [21:44:43] (03CR) 10Bstorm: [C: 03+2] cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/705507 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [21:45:32] (03Abandoned) 10Ssingh: auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705508 (owner: 10Ssingh) [21:48:48] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:52:48] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:54:18] (03Abandoned) 10Bstorm: maintain-dbusers: rely on the UIDS, not username for all accounts [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:55:16] (03CR) 10Bstorm: [C: 03+2] puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah) [21:55:28] (03CR) 10Bstorm: [C: 03+2] "Looks good after review. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah) [22:01:54] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:03:02] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Case Number:2021-0719-0629 create with Juniper [22:03:47] 10SRE, 10ops-codfw: mgmt on logstash2021 inaccessible - https://phabricator.wikimedia.org/T286274 (10Papaul) 05Open→03Resolved Reset the IDRAC system back online [22:07:24] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:14:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:18:24] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:25:57] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10razzi) Yes please invite me to a meeting @elukey! Thanks for keeping things moving on this... [22:29:52] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:34:42] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:42:16] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:45:52] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:52:36] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:55:00] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210719T2300). [23:00:05] zabe: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:15] o/ [23:11:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Moving over to @Jclark-ctr to check the network on pc1014. Thanks, Willy [23:12:19] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286945 (10wiki_willy) a:03Cmjohnson [23:12:20] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:13:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286944 (10wiki_willy) a:03Cmjohnson [23:13:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286943 (10wiki_willy) a:03Cmjohnson [23:13:36] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286942 (10wiki_willy) a:03Cmjohnson [23:15:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) switches racked and partialy cabled. Waiting on 40g dac cables Vendor did not ship cables they where waiting on confirmation on tax status https://phabricator.... [23:28:44] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [23:35:42] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:37:14] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:52:27] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Dear Juniper Networks Customer, A Return to Factory (RTF) RMA has been created. Details of which are provided below. ***** RMA DETAILS ***** RMA Number: R200361...