[00:01:21] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:16:01] (03PS1) 10Cwhite: beta-logs: use output meta parameter to gate outputs [puppet] - 10https://gerrit.wikimedia.org/r/776023 (https://phabricator.wikimedia.org/T305088) [00:16:03] (03PS1) 10Cwhite: beta-logs: use bucket meta parameter to define curation buckets [puppet] - 10https://gerrit.wikimedia.org/r/776024 (https://phabricator.wikimedia.org/T305013) [00:19:55] (03PS1) 10Cwhite: logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025 [00:45:49] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:22] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:22] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:34] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:47:40] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:05:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:11:05] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:46:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:09:37] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:11:45] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:12:15] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:12:17] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:14:25] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:49:27] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:33:55] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:29] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:40:29] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:44:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:44:53] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:46:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:46:31] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:46:47] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:48:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:51:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:51:03] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:52:49] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:53:21] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:09] !log traffic engineering in drmrs to prevent link saturation [06:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220401T0700) [07:02:05] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:04:03] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:08:47] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:27] jelto: looks like prometheus can't fetch metrics from some gitlab hosts (see JobUnavailable), expected ? [07:15:17] (03PS1) 10Ayounsi: drmrs: offload traffic from Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/776157 [07:15:31] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:17:43] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:21] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10fgiunchedi) >>! In T299462#7821068, @MatthewVernon wrote: > I note that some of the setup checklist tasks for these hosts haven't been done, maybe that's it?... [07:24:27] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:26:41] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:27:27] (03CR) 10Ayounsi: [C: 03+2] "Already pushed." [homer/public] - 10https://gerrit.wikimedia.org/r/776157 (owner: 10Ayounsi) [07:28:22] (03Merged) 10jenkins-bot: drmrs: offload traffic from Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/776157 (owner: 10Ayounsi) [07:28:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:39] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:32:33] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10fgiunchedi) (my two cents) agreed option 1. seems preferable, and to clarify my position on T276972: I'm not against it per-se, I am questioning the "dc... [07:37:55] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:51:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:14] 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10JMeybohm) [07:53:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:01] 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10JMeybohm) >>! In T304891#7823127, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#... [07:57:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:01:24] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10fgiunchedi) >>! In T295157#7822191, @jcrespo wrote: > Scored, up for a review @fgiunchedi in a week? Context: https://wikitech.wikimedia.org/wiki/Incident_review_ritual Sure SGTM, I'll take part in the ritual on Ap... [08:02:05] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:03:55] 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Joe) After discussion in the meeting yesterday, we concluded that: * We will create a generic... [08:04:19] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:04:45] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:09:13] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:09:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:10:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:10:55] (03PS2) 10JMeybohm: Add correct tlsHostnames and extra SAN to datahub cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/773256 (https://phabricator.wikimedia.org/T303049) [08:11:35] (03CR) 10Filippo Giunchedi: "Actually taking +1 back, see inline" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:13:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [08:13:41] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:14:38] (03PS4) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) [08:15:29] I downtimed the Singtel related BFD/OSPF alerts until tomorrow (announced ETR) [08:16:42] (03CR) 10JMeybohm: [C: 03+2] Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [08:20:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:28:42] (03PS2) 10MVernon: Makefile, docs: Note this is now obsolete [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) [08:29:34] (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:31:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks for the fix" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:39:21] (03CR) 10MVernon: [V: 03+2 C: 03+2] Makefile, docs: Note this is now obsolete [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:42:35] are beta-only config changes allowed on Friday? (Iā€™d still pull and sync them in production, but it would be a -labs.php file) [08:42:37] !log rolling restart of ncredir instances to catch up on kernel upgrades [08:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:44:51] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [08:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:13] lovely Friday :) [08:48:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1001.eqiad.wmnet [08:48:54] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ncredir1001.eqiad.wmnet [08:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:33] sigh... today is not my day [08:49:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1001.eqiad.wmnet [08:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:14] (03PS1) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/776162 (https://phabricator.wikimedia.org/T300740) [08:52:16] (03PS1) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740) [08:53:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1001.eqiad.wmnet [08:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:56] (03PS2) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740) [08:54:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1002.eqiad.wmnet [08:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1002.eqiad.wmnet [08:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2001.codfw.wmnet [08:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:13] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ncredir2001.codfw.wmnet [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2002.codfw.wmnet [09:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2002.codfw.wmnet [09:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3001.esams.wmnet [09:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3001.esams.wmnet [09:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3002.esams.wmnet [09:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Tchanders) >>! In T303398#7821893, @herron wrote: >>>! In T303398#7796432, @jbond wrote: >> Change to stalled until TsepoThoabala return > > When is Tsep... [09:31:00] (03PS1) 10Daniel Kinzler: Always set MW_USE_CONFIG_SCHEMA. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) [09:32:10] (03CR) 10Kormat: [C: 04-1] auto_schema: Wrap starting replication with finally (031 comment) [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [09:34:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Always set MW_USE_CONFIG_SCHEMA. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) (owner: 10Daniel Kinzler) [09:35:45] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ncredir3002.esams.wmnet [09:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4001.ulsfo.wmnet [09:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:02] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [09:43:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4001.ulsfo.wmnet [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4002.ulsfo.wmnet [09:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:17] (03PS1) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 [09:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:47] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34652/console" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman) [09:45:51] (03PS1) 10Jcrespo: dbbackups: Start backing up orchestrator & rename section db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) [09:47:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4002.ulsfo.wmnet [09:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:33] (03CR) 10Jcrespo: "This will make the backup check fail, so it needs a monitoring patch followup. But enough to start backing it up and make sure the change " [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [09:53:04] (03PS1) 10Jcrespo: dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) [09:55:43] (03PS2) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 [09:56:17] (03CR) 10Jcrespo: [C: 04-1] "This depends on a refactoring of how we check valid sections for now. Initially I wanted to have the checking and the definition separate," [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [09:57:09] (03PS3) 10Klausman: Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 [09:57:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5001.eqsin.wmnet [09:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus bits LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman) [09:59:40] (03PS1) 10Jcrespo: check: Make zarcillo an invalid section, make db_inventory a valid one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) [10:00:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1013:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [10:03:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5001.eqsin.wmnet [10:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:19] !log vgutierrez@puppetmaster1001:~$ sudo -i rm /var/run/confd-template/.ml-staging-ctrl*.err [10:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:49] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:05:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:05:25] klausman: ^^ [10:05:32] (the recovery) [10:06:03] Nice [10:06:30] !log vgutierrez@puppetmaster2001:~$ sudo -i rm /var/run/confd-template/.ml-staging-ctrl*.err [10:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:43] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:08:51] (03PS4) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 [10:09:17] (03PS5) 10Klausman: Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 [10:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:11:15] (03CR) 10Klausman: [C: 03+2] Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman) [10:11:51] (03CR) 10Klausman: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34653/console" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman) [10:14:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5002.eqsin.wmnet [10:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:53] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:20:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5002.eqsin.wmnet [10:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir6001.drmrs.wmnet [10:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:30] (03CR) 10Giuseppe Lavagetto: "This change is ready for review." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [10:29:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir6001.drmrs.wmnet [10:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir6002.drmrs.wmnet [10:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:30:28] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:34:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir6002.drmrs.wmnet [10:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:58] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:40:13] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:47:33] !log reboot acme-chief instances to catch up on kernel upgrades [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [10:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:27] (03CR) 10Kormat: [C: 03+1] check: Make zarcillo an invalid section, make db_inventory a valid one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [10:51:39] (03CR) 10Kormat: [C: 03+1] dbbackups: Start backing up orchestrator & rename section db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [10:54:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 69 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:54:13] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [10:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 64 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:01:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet [11:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [11:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:03] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [11:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:01] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [11:42:17] PROBLEM - puppet last run on deneb is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:44:51] 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10MoritzMuehlenhoff) I had a closer look at the source packages and this is caused by debian/rules file in Buster; it misses to in... [11:48:01] (03CR) 10Muehlenhoff: ipmiseld: ensure service enabled and running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [11:48:33] RECOVERY - puppet last run on deneb is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:48:45] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:55:25] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:23:10] (03PS1) 10Majavah: Rename O:ldap::labs to O:ldap::rw [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) [12:26:36] (03PS1) 10Majavah: Update ldap role names [labs/private] - 10https://gerrit.wikimedia.org/r/776188 (https://phabricator.wikimedia.org/T295150) [12:26:37] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Documentation updated: https://wikitech.wikimedia.org/wiki/Netflow/sflow [12:27:02] (03PS2) 10Majavah: Rename O:ldap::labs to O:ldap::rw [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) [12:28:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [12:29:39] (03CR) 10Majavah: "pcc, although failing because it needs the attached labs/private change: https://puppet-compiler.wmflabs.org/pcc-worker1002/34654/" [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [12:30:04] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [12:36:09] (03PS1) 10Majavah: striker: Use ldap-rw hostname for ldap [puppet] - 10https://gerrit.wikimedia.org/r/776189 (https://phabricator.wikimedia.org/T295150) [12:40:01] (BlazegraphJvmQuakeWarnGC) firing: (3) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [12:40:52] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) It's clear from the above that we have two distinct use cases that have emerged for the web proxies: | # | Name... [12:45:01] (BlazegraphJvmQuakeWarnGC) firing: (4) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [12:46:13] (03PS1) 10Majavah: P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) [12:48:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34656/console" [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [12:49:08] (03PS2) 10Majavah: P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) [12:52:25] !log restarting blazegraph on wdqs1006 and resetting jvmquake warning flag [12:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:01] (BlazegraphJvmQuakeWarnGC) firing: (5) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [12:56:48] ^ hm I think this mixes up resolved and firing alerts [13:00:11] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:30] !log reseting jvmquake flag on all wdqs hosts [13:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:01] (BlazegraphJvmQuakeWarnGC) resolved: (5) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [13:16:27] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:33] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [13:23:22] (03CR) 10Ladsgroup: auto_schema: Wrap starting replication with finally (031 comment) [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [13:25:22] (03PS3) 10Ladsgroup: auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 [13:35:53] (03PS1) 10Jelto: gitlab: pass restore interval to gitlab::restore module [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) [13:39:48] (03PS1) 10Thiemo Kreuz (WMDE): Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 [13:40:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34657/console" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [13:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:17:39] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:19:49] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:02] 10SRE, 10Analytics-Radar, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) In https://phabricator.wikimedia.org/T304373#7823916 @fgiunchedi wrote > to clarify my position on T276972: I'm not against it per-se, I am questionin... [14:35:58] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:06:11] (03CR) 10Func: [C: 04-1] "You are sorting them? Maybe it should be in a follow-up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [15:07:06] (03CR) 10Func: [C: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [15:10:10] (03CR) 10Func: [C: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [15:17:57] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:22:29] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10Cmjohnson) The part has been shipped [15:27:40] (03PS1) 10Ssingh: certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) [15:30:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34658/console" [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [15:44:53] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon) A couple of Friday-afternoon thoughts, not any kind of policy statement: Swift is somewhat directly available both directly within the WMF network, and via our usual caching layers to the... [15:53:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:53:33] (03PS1) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 [15:54:06] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (owner: 10Btullis) [15:54:48] (03PS2) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 [16:02:48] (03PS3) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [16:05:30] (03PS12) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [16:05:36] (03PS13) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [16:10:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [16:13:40] (03PS4) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [16:14:14] (03CR) 10jerkins-bot: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:15:45] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Note - we've made a last-minute change of plans about the timeline of the experiment, and decided to shorten it by one hour. We'll be r... [16:18:55] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:26] (03PS1) 10Jelto: gitlab: move backups to /mnt/gitlab-backups [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) [16:21:30] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776231 [16:21:38] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776232 [16:22:09] (03PS2) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) [16:23:06] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 [16:23:49] (03CR) 10Dzahn: gitlab: move backups to /mnt/gitlab-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:28:08] (03PS3) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) [16:28:59] (03PS5) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [16:29:34] (03CR) 10jerkins-bot: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:31:22] (03PS1) 10Ladsgroup: dbtools: Drop unused control-mariadb files [software] - 10https://gerrit.wikimedia.org/r/776235 [16:33:59] (03PS6) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [16:38:19] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34662/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:44:53] (03PS1) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) [16:47:25] (03CR) 10DannyS712: [C: 03+1] Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [16:48:35] (03PS2) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) [16:52:02] (03PS3) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) [16:52:39] (03PS1) 10Dzahn: Revert "gitlab: run backup and restore twice daily" [puppet] - 10https://gerrit.wikimedia.org/r/776034 [16:54:15] (03CR) 10Jelto: [C: 03+1] "lgtm until dedicated disks for backup are setup (I18123dca6b5092989787b37be450d5990c80acb3)" [puppet] - 10https://gerrit.wikimedia.org/r/776034 (owner: 10Dzahn) [16:54:31] (03PS4) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) [16:55:18] (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: run backup and restore twice daily" [puppet] - 10https://gerrit.wikimedia.org/r/776034 (owner: 10Dzahn) [16:56:10] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1002/34666/" [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:01:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:03:11] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:01] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:21:13] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:04] (03CR) 10Dzahn: [C: 03+1] "looks good to me but let's do the module rename next week" [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:45:47] (03PS1) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 [18:00:28] (03PS2) 10Ssingh: certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) [18:02:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34667/console" [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [18:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:22:23] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:13] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:42:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2036.codfw.wmnet [18:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:25] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw141[4-8].wmnet [18:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:03] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw1414.wmnet [19:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:25] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw141[4-8].wmnet [19:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:45] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1414.wmnet [19:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:47] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp2036.codfw.wmnet [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:33] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw141[4-8].eqiad.wmnet [19:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:51] !Log rebooting mw canary appserver eqiad - mw1414, mw1415, mw1416, mw1417 [19:10:27] PROBLEM - Host mw1414 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:33] PROBLEM - Host mw1417 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:25] ACKNOWLEDGEMENT - Host mw1414 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot [19:11:25] ACKNOWLEDGEMENT - Host mw1417 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot [19:11:29] RECOVERY - Host mw1414 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:13:37] RECOVERY - Host mw1417 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:16:12] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw141[4-8].eqiad.wmnet [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:47] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw144[7-9].eqiad.wmnet [19:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:29] !Log rebooting mw canary API appserver eqiad - mw1447, mw1448, mw1449, mw1450 [19:28:59] PROBLEM - Host mw1448 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:09] RECOVERY - Host mw1448 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [19:35:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=ats-be [19:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=ats-tls [19:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=varnish-fe [19:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:44] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1450.eqiad.wmnet [19:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:52] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw144[7-9].eqiad.wmnet [19:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:47] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp102[5-6].eqiad.wmnet [19:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:09] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=parse200[1-2].eqiad.wmnet [19:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:17] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse200[1-2].eqiad.wmnet [19:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:28] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse200[1-2].codfw.wmnet [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:19] !Log rebooting parsoid canary appservers - wtp1025, wtp1026, parse2001, parse2002 [19:44:05] PROBLEM - Host wtp1025 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:10] !log rebooting parsoid canary appservers - wtp1025, wtp1026, parse2001, parse2002 [19:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:15] PROBLEM - Host wtp1026 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:47] RECOVERY - Host wtp1025 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:45:31] PROBLEM - Host parse2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:41] RECOVERY - Host wtp1026 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:45:43] RECOVERY - Host parse2001 is UP: PING WARNING - Packet loss = 80%, RTA = 31.64 ms [19:47:48] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse200[1-2].codfw.wmnet [19:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:11] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp102[5-6].eqiad.wmnet [19:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [20:18:43] (03PS8) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [20:19:15] !log volans@cumin1001 START - Cookbook sre.dns.netbox [20:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:09] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:21] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34669/" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [20:27:27] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:27:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34670/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [20:29:36] (03CR) 10Dzahn: "noop on gitlab1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [20:34:17] (03PS1) 10Zabe: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) [20:35:42] (03PS2) 10Zabe: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) [20:59:22] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Ottomata) Hello! Yes! No policy statements here; we are in the 'feedback / alignment building' phase of talking about Shared Data Platform. :) Data stored in Shared Data Platform is intended to be load... [21:10:47] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:45] (03PS1) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 [21:17:15] (03PS2) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089) [21:31:48] (03PS1) 10Zabe: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) [21:38:39] (03PS1) 10Zabe: Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) [21:38:41] (03PS1) 10Zabe: Migrate $wmfUsingKubernetes to $wmgUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) [21:38:43] (03PS1) 10Zabe: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) [21:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:03] (03PS1) 10Zabe: Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) [21:46:05] (03PS1) 10Zabe: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) [21:46:07] (03PS1) 10Zabe: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) [21:55:16] (03CR) 10BBlack: [C: 03+2] Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [21:55:23] (03PS3) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089) [22:04:45] !log esams re-pooled - T304089 [22:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:49] T304089: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 [22:09:00] (03PS4) 10Zabe: Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh) [22:10:19] (03CR) 10Zabe: [C: 03+1] Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh) [22:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:21:39] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [22:26:13] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:36:13] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:39:57] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Test concluded, and esams is re-pooled. More analysis and planning to follow next week I'm sure, but the basic highlights are: * We we... [22:52:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:58:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:04:22] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:04:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:06:10] looking [23:06:58] <- here if you need anything [23:07:35] nah, it looks like just T291707 again -- doing the rolling restart in eqiad [23:07:36] T291707: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 [23:07:51] ack [23:08:04] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: sync [23:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:21] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [23:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:33] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.021 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:08:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:09:33] graphs improving like magic :) I will be happier when we get that fixed [23:11:14] well.. i just got back in the door when it sent the resolved. ACK it seemed like that thing you fixed yesterday [23:11:39] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:12:20] yeah, the swagger patch linked in that bug is the real fix, we'll be able to health-check those pods so they restart automatically when they get stuck, instead of paging us [23:12:36] well, the *real* fix would be rewriting zotero so it doesn't get stuck at all :) but the achievable one is that other thing [23:13:20] should land soon hopefully -- in the meantime, at least the rolling restart is quick šŸ™ƒ [23:13:57] ACK!, thank you [23:14:14] still looks stable, I'm checking out again -- thanks bblack and mutante for being around [23:16:46] sees https://wikitech.wikimedia.org/wiki/Zotero#Rolling_restart in case we need it again [23:19:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Dzahn) Is it ok if we close this ticket and you just reopen it again once he is back? [23:20:46] (03PS1) 10Dzahn: add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279) [23:20:56] (03PS2) 10Dzahn: add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279) [23:23:13] mutante: you don't need to create the patches, the bot does. And I think now it thinks all patches are made and won't create [23:23:19] *any other ones either [23:23:56] (the script runs every six hours) [23:24:11] hmm.. duly noted for next time [23:24:19] but too late now I guess..and might as well merge it? [23:24:33] sure [23:24:45] ok [23:25:02] (03CR) 10Dzahn: [C: 03+2] add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279) (owner: 10Dzahn) [23:25:11] !log DNS - new project language 'kcg'. 'Tyap is a regionally important dialect cluster of Plateau languages in Nigeria's Middle Belt, named after its prestige dialect. It is also known by its Hausa exonym as Katab or Kataf.' T305279 [23:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:14] T305279: Create Wikipedia Tyap - https://phabricator.wikimedia.org/T305279 [23:28:04] Amir1: I think we need something like wikimediausergroups.org but shorter :) [23:28:13] and then put the requested user group wikis in that [23:28:29] Indic and Uzbek [23:29:05] mutante: yup but that needs to be approved by affcom first, that has been waiting on them for two years I think [23:29:20] hmm.. nod .ACK, ok [23:31:30] running the script now [23:31:44] let's hope it doesn't skip stuff [23:32:05] I was about to ping you to tell you the bot did it, heh [23:32:29] looks good afaict [23:32:52] T305284 [23:32:53] T305284: Add kcgwiki to wikistats - https://phabricator.wikimedia.org/T305284 [23:32:55] I also see a task for wikistats :) [23:32:58] thanks Amir1 [23:33:26] I think the patch for restbase could also be automated [23:33:34] will add it though once I get the mail from "newprojects" that happens when createwiki.sh runs [23:33:50] thankfully it's smarter about gerrit patches, if you create a subticket or parent, it wont' create any [23:33:56] zabe: PR welcome ;) [23:34:16] I think there were some complexity around placing the url but not too complciated [23:34:38] what I hope is to have the configs on yaml files so we can make the init mw config patch automated [23:34:55] it's ongoing, I'm happy [23:38:03] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:40:11] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:41:24] (03CR) 10Ladsgroup: [C: 03+1] Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:41:59] (03PS2) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670)