[00:04:08] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:05:46] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:14:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:15:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:44:58] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:22] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) p:05Triage→03Medium [01:08:15] 10SRE, 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None This shouldn't be assigned to me; I've never had a concrete plan to work on it. [01:37:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:31:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:32:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:32:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:05:04] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:33:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:35] (03PS2) 10Samtar: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) (owner: 10Samwilson) [05:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:11:04] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:40] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:50] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:22] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10ayounsi) I emailed the RIPE to let them know this anchor is definitively offline. [07:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221223T0800) [08:00:50] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10ayounsi) Thanks! Do you think it's worth asking in which release it's going to be fixed? To me, flooding logs with `ERROR` severity... [08:13:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:14:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:34] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10MatthewVernon) [mostly a note for future reference] Fri 23rd is a working day in Europe, so in an ideal world this change would have been deployed around end of European working day.... [08:29:20] 10SRE, 10Observability-Alerting, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10taavi) [09:26:31] (03CR) 10David Caro: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [09:37:19] (03PS3) 10David Caro: alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) [09:37:21] (03PS3) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) [09:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:47:12] (03CR) 10Marostegui: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:47:57] (03CR) 10Jcrespo: [C: 04-2] mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:48:51] (03CR) 10Jcrespo: [C: 04-2] "Actually they will be disabled because of the spare role, but I can keep the lines for when it has a definitive role." [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:49:48] (03CR) 10Marostegui: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:50:19] (03PS3) 10Jcrespo: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) [09:50:43] (03CR) 10Majavah: [C: 04-1] "I don't think the backend class should know anything about actually reading the config file. Instead, the config file should be read in sc" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [09:51:11] (03CR) 10Jcrespo: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [10:05:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:20] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:38] (03CR) 10David Caro: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [10:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:55:25] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10cmooney) Yeah good idea I'll chase up on that with them also. [10:59:25] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: handle active/passive services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:26:28] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/870904 (https://phabricator.wikimedia.org/T314929) (owner: 10Ayounsi) [11:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:58:46] (03CR) 10Jcrespo: "Data check done + backup (while also testing https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/863264 ), ok to deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:59:06] (03CR) 10Jcrespo: [C: 03+1] mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:00:08] (03Abandoned) 10MarcoAurelio: Translate TimedText namespaces for iswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824444 (https://phabricator.wikimedia.org/T315715) (owner: 10MarcoAurelio) [12:03:36] (03CR) 10Marostegui: [C: 03+1] mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:07:46] (03CR) 10Jcrespo: [C: 03+2] mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:24:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.460 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:17:40] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) thanks @hashar for the h2 cleanup. The filesystem usage on `/var/lib/gerrit2` went down from 75% to... [13:18:13] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) [13:32:00] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) [13:49:05] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) [13:50:27] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) Marking this task dependent on DHCP option 97 to reduce the risk of DHCP oddities related to Option 82. [13:50:58] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) 05Open→03Resolved Closing for now, consensus seems to be to keep status quo. [13:54:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10cmooney) 05Open→03Resolved a:03cmooney I'm going to close this one, I believe the differences are to be expected and there... [14:01:21] (03PS2) 10Gehel: ApiFeatureUsage logstash servers are owned by Observability. [puppet] - 10https://gerrit.wikimedia.org/r/869582 (https://phabricator.wikimedia.org/T325880) [14:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:48] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:32] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:37:48] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:33] (03PS4) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [15:01:55] (03PS3) 10MVernon: hiera: move swift accounts_keys into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) [16:05:56] (03PS1) 10Jcrespo: WIP:bacula:Create a parallel backup director on codfw [puppet] - 10https://gerrit.wikimedia.org/r/871180 (https://phabricator.wikimedia.org/T313582) [16:48:30] (03PS5) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [16:48:40] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:55] (03PS9) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [17:44:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:49:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:35:18] (03CR) 10BCornwall: [C: 03+2] admin: Add kelhurd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870708 (https://phabricator.wikimedia.org/T323943) (owner: 10BCornwall) [18:37:09] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10BCornwall) [18:37:21] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10BCornwall) Is anything left to do for this ticket? [18:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:08:28] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:00] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:13:30] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:18:58] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:19:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:54:56] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:56:24] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:27:40] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:33:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:00:40] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Xaosflux) Current new example: https://e... [23:13:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [23:21:30] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:32:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:37:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable