[00:01:21] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:16:01] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: use output meta parameter to gate outputs [puppet] - 10https://gerrit.wikimedia.org/r/776023 (https://phabricator.wikimedia.org/T305088)
[00:16:03] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: use bucket meta parameter to define curation buckets [puppet] - 10https://gerrit.wikimedia.org/r/776024 (https://phabricator.wikimedia.org/T305013)
[00:19:55] <wikibugs>	 (03PS1) 10Cwhite: logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025
[00:45:49] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:22] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:22] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:34] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[01:47:40] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[02:05:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:11:05] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:46:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:09:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:11:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:12:15] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:12:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:14:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:49:27] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:45:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:05:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:33:55] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:23] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:40:29] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:44:25] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:44:53] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:46:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:46:31] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:46:47] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:48:57] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:51:01] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:51:03] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:52:49] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:53:21] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:54:09] <XioNoX>	 !log traffic engineering in drmrs to prevent link saturation
[06:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:37] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220401T0700)
[07:02:05] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:04:03] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:08:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:27] <godog>	 jelto: looks like prometheus can't fetch metrics from some gitlab hosts (see JobUnavailable), expected ?
[07:15:17] <wikibugs>	 (03PS1) 10Ayounsi: drmrs: offload traffic from Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/776157
[07:15:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:17:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:23:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10fgiunchedi) >>! In T299462#7821068, @MatthewVernon wrote: > I note that some of the setup checklist tasks for these hosts haven't been done, maybe that's it?...
[07:24:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:26:41] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:27:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Already pushed." [homer/public] - 10https://gerrit.wikimedia.org/r/776157 (owner: 10Ayounsi)
[07:28:22] <wikibugs>	 (03Merged) 10jenkins-bot: drmrs: offload traffic from Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/776157 (owner: 10Ayounsi)
[07:28:25] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:30:39] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:32:33] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10fgiunchedi) (my two cents) agreed option 1. seems preferable, and to clarify my position on T276972: I'm not against it per-se, I am questioning the "dc...
[07:37:55] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:44:39] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:51:25] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:52:14] <wikibugs>	 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10JMeybohm)
[07:53:39] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:55:37] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:56:01] <wikibugs>	 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10JMeybohm) >>! In T304891#7823127, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#...
[07:57:51] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:01:24] <wikibugs>	 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10fgiunchedi) >>! In T295157#7822191, @jcrespo wrote: > Scored, up for a review @fgiunchedi in a week? Context: https://wikitech.wikimedia.org/wiki/Incident_review_ritual  Sure SGTM, I'll take part in the ritual on Ap...
[08:02:05] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:03:55] <wikibugs>	 10SRE, 10Generated Data Platform, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Joe) After discussion in the meeting yesterday, we concluded that: * We will create a generic...
[08:04:19] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:04:45] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:09:13] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:09:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:10:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:10:55] <wikibugs>	 (03PS2) 10JMeybohm: Add correct tlsHostnames and extra SAN to datahub cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/773256 (https://phabricator.wikimedia.org/T303049)
[08:11:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Actually taking +1 back, see inline" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:13:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff)
[08:13:41] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:14:38] <wikibugs>	 (03PS4) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740)
[08:15:29] <XioNoX>	 I downtimed the Singtel related BFD/OSPF alerts until tomorrow (announced ETR)
[08:16:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm)
[08:20:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:28:42] <wikibugs>	 (03PS2) 10MVernon: Makefile, docs: Note this is now obsolete [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117)
[08:29:34] <wikibugs>	 (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:31:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks for the fix" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:39:21] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] Makefile, docs: Note this is now obsolete [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:42:35] <Lucas_WMDE>	 are beta-only config changes allowed on Friday? (I’d still pull and sync them in production, but it would be a -labs.php file)
[08:42:37] <vgutierrez>	 !log rolling restart of ncredir instances to catch up on kernel upgrades
[08:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:51] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[08:44:51] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
[08:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:13] <vgutierrez>	 lovely Friday :)
[08:48:53] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1001.eqiad.wmnet
[08:48:54] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ncredir1001.eqiad.wmnet
[08:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:33] <vgutierrez>	 sigh... today is not my day
[08:49:46] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1001.eqiad.wmnet
[08:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:14] <wikibugs>	 (03PS1) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/776162 (https://phabricator.wikimedia.org/T300740)
[08:52:16] <wikibugs>	 (03PS1) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740)
[08:53:52] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1001.eqiad.wmnet
[08:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:56] <wikibugs>	 (03PS2) 10JMeybohm: Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740)
[08:54:29] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1002.eqiad.wmnet
[08:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:35] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1002.eqiad.wmnet
[08:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:03] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2001.codfw.wmnet
[08:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:13] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ncredir2001.codfw.wmnet
[09:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:50] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2002.codfw.wmnet
[09:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:33] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2002.codfw.wmnet
[09:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:30] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3001.esams.wmnet
[09:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:10] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3001.esams.wmnet
[09:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:27] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3002.esams.wmnet
[09:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Tchanders) >>! In T303398#7821893, @herron wrote: >>>! In T303398#7796432, @jbond wrote: >> Change to stalled until TsepoThoabala return >  > When is Tsep...
[09:31:00] <wikibugs>	 (03PS1) 10Daniel Kinzler: Always set MW_USE_CONFIG_SCHEMA. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176)
[09:32:10] <wikibugs>	 (03CR) 10Kormat: [C: 04-1] auto_schema: Wrap starting replication with finally (031 comment) [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup)
[09:34:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Always set MW_USE_CONFIG_SCHEMA. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) (owner: 10Daniel Kinzler)
[09:35:45] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ncredir3002.esams.wmnet
[09:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:19] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4001.ulsfo.wmnet
[09:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:02] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis)
[09:43:18] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4001.ulsfo.wmnet
[09:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:26] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4002.ulsfo.wmnet
[09:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:17] <wikibugs>	 (03PS1) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168
[09:45:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:45:47] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34652/console" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman)
[09:45:51] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Start backing up orchestrator & rename section db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315)
[09:47:54] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4002.ulsfo.wmnet
[09:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:33] <wikibugs>	 (03CR) 10Jcrespo: "This will make the backup check fail, so it needs a monitoring patch followup. But enough to start backing it up and make sure the change " [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[09:53:04] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315)
[09:55:43] <wikibugs>	 (03PS2) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168
[09:56:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "This depends on a refactoring of how we check valid sections for now. Initially I wanted to have the checking and the definition separate," [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[09:57:09] <wikibugs>	 (03PS3) 10Klausman: Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168
[09:57:25] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5001.eqsin.wmnet
[09:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus bits LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman)
[09:59:40] <wikibugs>	 (03PS1) 10Jcrespo: check: Make zarcillo an invalid section, make db_inventory a valid one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315)
[10:00:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1013:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[10:03:40] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5001.eqsin.wmnet
[10:03:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:19] <vgutierrez>	 !log vgutierrez@puppetmaster1001:~$ sudo -i rm /var/run/confd-template/.ml-staging-ctrl*.err
[10:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:49] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:05:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:05:25] <vgutierrez>	 klausman: ^^
[10:05:32] <vgutierrez>	 (the recovery)
[10:06:03] <klausman>	 Nice
[10:06:30] <vgutierrez>	 !log vgutierrez@puppetmaster2001:~$ sudo -i rm /var/run/confd-template/.ml-staging-ctrl*.err
[10:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:43] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:08:51] <wikibugs>	 (03PS4) 10Klausman: Temporarily revert conftool config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168
[10:09:17] <wikibugs>	 (03PS5) 10Klausman: Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168
[10:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:11:15] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Temporarily revert some config for ml staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman)
[10:11:51] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34653/console" [puppet] - 10https://gerrit.wikimedia.org/r/776168 (owner: 10Klausman)
[10:14:12] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5002.eqsin.wmnet
[10:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:53] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:20:28] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5002.eqsin.wmnet
[10:20:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:22] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir6001.drmrs.wmnet
[10:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "This change is ready for review." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[10:29:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir6001.drmrs.wmnet
[10:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:53] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir6002.drmrs.wmnet
[10:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:30:28] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:34:18] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir6002.drmrs.wmnet
[10:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:40:13] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:47:33] <vgutierrez>	 !log reboot acme-chief instances to catch up on kernel upgrades
[10:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:42] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[10:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:27] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] check: Make zarcillo an invalid section, make db_inventory a valid one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[10:51:39] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] dbbackups: Start backing up orchestrator & rename section db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[10:54:11] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 69 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:54:13] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet
[10:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:43] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet
[10:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:27] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 64 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:01:50] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet
[11:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[11:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:03] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[11:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[11:42:17] <icinga-wm>	 PROBLEM - puppet last run on deneb is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:44:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10MoritzMuehlenhoff) I had a closer look at the source packages and this is caused by debian/rules file in Buster; it misses to in...
[11:48:01] <wikibugs>	 (03CR) 10Muehlenhoff: ipmiseld: ensure service enabled and running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron)
[11:48:33] <icinga-wm>	 RECOVERY - puppet last run on deneb is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:48:45] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:55:25] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[12:23:10] <wikibugs>	 (03PS1) 10Majavah: Rename O:ldap::labs to O:ldap::rw [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150)
[12:26:36] <wikibugs>	 (03PS1) 10Majavah: Update ldap role names [labs/private] - 10https://gerrit.wikimedia.org/r/776188 (https://phabricator.wikimedia.org/T295150)
[12:26:37] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Documentation updated: https://wikitech.wikimedia.org/wiki/Netflow/sflow
[12:27:02] <wikibugs>	 (03PS2) 10Majavah: Rename O:ldap::labs to O:ldap::rw [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150)
[12:28:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff)
[12:29:39] <wikibugs>	 (03CR) 10Majavah: "pcc, although failing because it needs the attached labs/private change: https://puppet-compiler.wmflabs.org/pcc-worker1002/34654/" [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah)
[12:30:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff)
[12:36:09] <wikibugs>	 (03PS1) 10Majavah: striker: Use ldap-rw hostname for ldap [puppet] - 10https://gerrit.wikimedia.org/r/776189 (https://phabricator.wikimedia.org/T295150)
[12:40:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: (3) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[12:40:52] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) It's clear from the above that we have two distinct use cases that have emerged for the web proxies:  | # | Name...
[12:45:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: (4) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[12:46:13] <wikibugs>	 (03PS1) 10Majavah: P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150)
[12:48:20] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34656/console" [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah)
[12:49:08] <wikibugs>	 (03PS2) 10Majavah: P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150)
[12:52:25] <dcausse>	 !log restarting blazegraph on wdqs1006 and resetting jvmquake warning flag
[12:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: (5) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[12:56:48] <dcausse>	 ^ hm I think this mixes up resolved and firing alerts
[13:00:11] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:05:30] <dcausse>	 !log reseting jvmquake flag on all wdqs hosts
[13:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) resolved: (5) Blazegraph instance wdqs1006:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[13:16:27] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:17:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff)
[13:23:22] <wikibugs>	 (03CR) 10Ladsgroup: auto_schema: Wrap starting replication with finally (031 comment) [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup)
[13:25:22] <wikibugs>	 (03PS3) 10Ladsgroup: auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895
[13:35:53] <wikibugs>	 (03PS1) 10Jelto: gitlab: pass restore interval to gitlab::restore module [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867)
[13:39:48] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200
[13:40:36] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34657/console" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto)
[13:45:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:17:39] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:19:49] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:28:02] <wikibugs>	 10SRE, 10Analytics-Radar, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) In https://phabricator.wikimedia.org/T304373#7823916 @fgiunchedi wrote > to clarify my position on T276972: I'm not against it per-se, I am questionin...
[14:35:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:06:11] <wikibugs>	 (03CR) 10Func: [C: 04-1] "You are sorting them? Maybe it should be in a follow-up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung)
[15:07:06] <wikibugs>	 (03CR) 10Func: [C: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung)
[15:10:10] <wikibugs>	 (03CR) 10Func: [C: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung)
[15:17:57] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:22:29] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10Cmjohnson) The part has been shipped
[15:27:40] <wikibugs>	 (03PS1) 10Ssingh: certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993)
[15:30:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34658/console" [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh)
[15:44:53] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon) A couple of Friday-afternoon thoughts, not any kind of policy statement:  Swift is somewhat directly available both directly within the WMF network, and via our usual caching layers to the...
[15:53:25] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:53:33] <wikibugs>	 (03PS1) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225
[15:54:06] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (owner: 10Btullis)
[15:54:48] <wikibugs>	 (03PS2) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225
[16:02:48] <wikibugs>	 (03PS3) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[16:05:30] <wikibugs>	 (03PS12) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308)
[16:05:36] <wikibugs>	 (03PS13) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308)
[16:10:01] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[16:13:40] <wikibugs>	 (03PS4) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[16:14:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:15:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Note - we've made a last-minute change of plans about the timeline of the experiment, and decided to shorten it by one hour.  We'll be r...
[16:18:55] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:20:26] <wikibugs>	 (03PS1) 10Jelto: gitlab: move backups to /mnt/gitlab-backups [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463)
[16:21:30] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776231
[16:21:38] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776232
[16:22:09] <wikibugs>	 (03PS2) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463)
[16:23:06] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233
[16:23:49] <wikibugs>	 (03CR) 10Dzahn: gitlab: move backups to /mnt/gitlab-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[16:28:08] <wikibugs>	 (03PS3) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463)
[16:28:59] <wikibugs>	 (03PS5) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[16:29:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:31:22] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Drop unused control-mariadb files [software] - 10https://gerrit.wikimedia.org/r/776235
[16:33:59] <wikibugs>	 (03PS6) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[16:38:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34662/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:44:53] <wikibugs>	 (03PS1) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942)
[16:47:25] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE))
[16:48:35] <wikibugs>	 (03PS2) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942)
[16:52:02] <wikibugs>	 (03PS3) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942)
[16:52:39] <wikibugs>	 (03PS1) 10Dzahn: Revert "gitlab: run backup and restore twice daily" [puppet] - 10https://gerrit.wikimedia.org/r/776034
[16:54:15] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm until dedicated disks for backup are setup (I18123dca6b5092989787b37be450d5990c80acb3)" [puppet] - 10https://gerrit.wikimedia.org/r/776034 (owner: 10Dzahn)
[16:54:31] <wikibugs>	 (03PS4) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942)
[16:55:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: run backup and restore twice daily" [puppet] - 10https://gerrit.wikimedia.org/r/776034 (owner: 10Dzahn)
[16:56:10] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1002/34666/" [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:01:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:03:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:05:01] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:21:13] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:40:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me but let's do the module rename next week" [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:45:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:45:47] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241
[18:00:28] <wikibugs>	 (03PS2) 10Ssingh: certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993)
[18:02:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34667/console" [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh)
[18:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:22:23] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:36:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:42:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2036.codfw.wmnet
[18:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:25] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw141[4-8].wmnet
[18:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:03] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw1414.wmnet
[19:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:25] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw141[4-8].wmnet
[19:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:45] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1414.wmnet
[19:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:47] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp2036.codfw.wmnet
[19:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:33] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw141[4-8].eqiad.wmnet
[19:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:51] <mutante>	 !Log rebooting mw canary appserver eqiad - mw1414, mw1415, mw1416, mw1417
[19:10:27] <icinga-wm>	 PROBLEM - Host mw1414 is DOWN: PING CRITICAL - Packet loss = 100%
[19:10:33] <icinga-wm>	 PROBLEM - Host mw1417 is DOWN: PING CRITICAL - Packet loss = 100%
[19:11:25] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw1414 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot
[19:11:25] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw1417 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot
[19:11:29] <icinga-wm>	 RECOVERY - Host mw1414 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[19:13:37] <icinga-wm>	 RECOVERY - Host mw1417 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[19:16:12] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw141[4-8].eqiad.wmnet
[19:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:47] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw144[7-9].eqiad.wmnet
[19:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:29] <mutante>	 !Log rebooting mw canary API appserver eqiad - mw1447, mw1448, mw1449, mw1450
[19:28:59] <icinga-wm>	 PROBLEM - Host mw1448 is DOWN: PING CRITICAL - Packet loss = 100%
[19:30:09] <icinga-wm>	 RECOVERY - Host mw1448 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms
[19:35:50] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=ats-be
[19:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:54] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=ats-tls
[19:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:01] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet,service=varnish-fe
[19:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:44] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1450.eqiad.wmnet
[19:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:52] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw144[7-9].eqiad.wmnet
[19:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:47] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp102[5-6].eqiad.wmnet
[19:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:09] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=parse200[1-2].eqiad.wmnet
[19:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:17] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse200[1-2].eqiad.wmnet
[19:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:28] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse200[1-2].codfw.wmnet
[19:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:19] <mutante>	 !Log rebooting parsoid canary appservers - wtp1025, wtp1026, parse2001, parse2002
[19:44:05] <icinga-wm>	 PROBLEM - Host wtp1025 is DOWN: PING CRITICAL - Packet loss = 100%
[19:44:10] <mutante>	 !log rebooting parsoid canary appservers - wtp1025, wtp1026, parse2001, parse2002
[19:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:15] <icinga-wm>	 PROBLEM - Host wtp1026 is DOWN: PING CRITICAL - Packet loss = 100%
[19:44:47] <icinga-wm>	 RECOVERY - Host wtp1025 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:45:31] <icinga-wm>	 PROBLEM - Host parse2001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:41] <icinga-wm>	 RECOVERY - Host wtp1026 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[19:45:43] <icinga-wm>	 RECOVERY - Host parse2001 is UP: PING WARNING - Packet loss = 80%, RTA = 31.64 ms
[19:47:48] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse200[1-2].codfw.wmnet
[19:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:11] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp102[5-6].eqiad.wmnet
[19:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:16] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[20:18:43] <wikibugs>	 (03PS8) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464)
[20:19:15] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[20:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:09] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34669/" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn)
[20:27:27] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:27:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34670/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto)
[20:29:36] <wikibugs>	 (03CR) 10Dzahn: "noop on gitlab1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/776198 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto)
[20:34:17] <wikibugs>	 (03PS1) 10Zabe: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004)
[20:35:42] <wikibugs>	 (03PS2) 10Zabe: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004)
[20:59:22] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Ottomata) Hello!  Yes! No policy statements here; we are in the 'feedback / alignment building' phase of talking about Shared Data Platform. :)  Data stored in Shared Data Platform is intended to be load...
[21:10:47] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:45] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045
[21:17:15] <wikibugs>	 (03PS2) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089)
[21:31:48] <wikibugs>	 (03PS1) 10Zabe: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956)
[21:38:39] <wikibugs>	 (03PS1) 10Zabe: Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956)
[21:38:41] <wikibugs>	 (03PS1) 10Zabe: Migrate $wmfUsingKubernetes to $wmgUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956)
[21:38:43] <wikibugs>	 (03PS1) 10Zabe: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956)
[21:45:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:46:03] <wikibugs>	 (03PS1) 10Zabe: Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956)
[21:46:05] <wikibugs>	 (03PS1) 10Zabe: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956)
[21:46:07] <wikibugs>	 (03PS1) 10Zabe: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956)
[21:55:16] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack)
[21:55:23] <wikibugs>	 (03PS3) 10BBlack: Revert "Depool esams to test drmrs at full EMEA load" [dns] - 10https://gerrit.wikimedia.org/r/776045 (https://phabricator.wikimedia.org/T304089)
[22:04:45] <bblack>	 !log esams re-pooled - T304089
[22:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:49] <stashbot>	 T304089: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089
[22:09:00] <wikibugs>	 (03PS4) 10Zabe: Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh)
[22:10:19] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh)
[22:10:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:21:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack)
[22:26:13] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:36:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:39:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Test concluded, and esams is re-pooled.  More analysis and planning to follow next week I'm sure, but the basic highlights are:  * We we...
[22:52:21] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[22:58:19] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:04:22] <icinga-wm>	 PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[23:04:31] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[23:06:10] <rzl>	 looking
[23:06:58] <bblack>	 <- here if you need anything
[23:07:35] <rzl>	 nah, it looks like just T291707 again -- doing the rolling restart in eqiad
[23:07:36] <stashbot>	 T291707: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707
[23:07:51] <bblack>	 ack
[23:08:04] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: sync
[23:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:21] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync
[23:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:33] <icinga-wm>	 RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.021 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[23:08:49] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:09:33] <rzl>	 graphs improving like magic :) I will be happier when we get that fixed
[23:11:14] <mutante>	 well.. i just got back in the door when it sent the resolved. ACK it seemed like that thing you fixed yesterday
[23:11:39] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:12:20] <rzl>	 yeah, the swagger patch linked in that bug is the real fix, we'll be able to health-check those pods so they restart automatically when they get stuck, instead of paging us
[23:12:36] <rzl>	 well, the *real* fix would be rewriting zotero so it doesn't get stuck at all :) but the achievable one is that other thing
[23:13:20] <rzl>	 should land soon hopefully -- in the meantime, at least the rolling restart is quick 🙃
[23:13:57] <mutante>	 ACK!, thank you
[23:14:14] <rzl>	 still looks stable, I'm checking out again -- thanks bblack and mutante for being around
[23:16:46] <mutante>	 sees https://wikitech.wikimedia.org/wiki/Zotero#Rolling_restart in case we need it again
[23:19:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Dzahn) Is it ok if we close this ticket and you just reopen it again once he is back?
[23:20:46] <wikibugs>	 (03PS1) 10Dzahn: add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279)
[23:20:56] <wikibugs>	 (03PS2) 10Dzahn: add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279)
[23:23:13] <Amir1>	 mutante: you don't need to create the patches, the bot does. And I think now it thinks all patches are made and won't create
[23:23:19] <Amir1>	 *any other ones either
[23:23:56] <Amir1>	 (the script runs every six hours)
[23:24:11] <mutante>	 hmm.. duly noted for next time
[23:24:19] <mutante>	 but too late now I guess..and might as well merge it?
[23:24:33] <Amir1>	 sure
[23:24:45] <mutante>	 ok
[23:25:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add new language kcg - Tyap_language [dns] - 10https://gerrit.wikimedia.org/r/776266 (https://phabricator.wikimedia.org/T305279) (owner: 10Dzahn)
[23:25:11] <mutante>	 !log DNS - new project language 'kcg'. 'Tyap is a regionally important dialect cluster of Plateau languages in Nigeria's Middle Belt, named after its prestige dialect. It is also known by its Hausa exonym as Katab or Kataf.' T305279
[23:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:14] <stashbot>	 T305279: Create Wikipedia Tyap - https://phabricator.wikimedia.org/T305279
[23:28:04] <mutante>	 Amir1: I think we need something like wikimediausergroups.org but shorter :)
[23:28:13] <mutante>	 and then put the requested user group wikis in that
[23:28:29] <mutante>	 Indic and Uzbek 
[23:29:05] <Amir1>	 mutante: yup but that needs to be approved by affcom first, that has been waiting on them for two years I think
[23:29:20] <mutante>	 hmm.. nod .ACK, ok
[23:31:30] <Amir1>	 running the script now
[23:31:44] <Amir1>	 let's hope it doesn't skip stuff
[23:32:05] <mutante>	 I was about to ping you to tell you the bot did it, heh
[23:32:29] <mutante>	 looks good afaict
[23:32:52] <Amir1>	 T305284
[23:32:53] <stashbot>	 T305284: Add kcgwiki to wikistats - https://phabricator.wikimedia.org/T305284
[23:32:55] <mutante>	 I also see a task for wikistats :)
[23:32:58] <mutante>	 thanks Amir1 
[23:33:26] <zabe>	 I think the patch for restbase could also be automated
[23:33:34] <mutante>	 will add it though once I get the mail from "newprojects" that happens when createwiki.sh runs
[23:33:50] <Amir1>	 thankfully it's smarter about gerrit patches, if you create a subticket or parent, it wont' create any
[23:33:56] <Amir1>	 zabe: PR welcome ;)
[23:34:16] <Amir1>	 I think there were some complexity around placing the url but not too complciated
[23:34:38] <Amir1>	 what I hope is to have the configs on yaml files so we can make the init mw config patch automated 
[23:34:55] <Amir1>	 it's ongoing, I'm happy
[23:38:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:40:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:41:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[23:41:59] <wikibugs>	 (03PS2) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670)