[00:00:33] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:02:38] (03PS6) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [00:02:51] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [00:03:10] (03CR) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [00:05:19] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw spine switch overlay IRBs. - cmooney@cumin1001" [00:05:33] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:06:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw spine switch overlay IRBs. - cmooney@cumin1001" [00:06:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:10:33] (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:12:26] (03PS7) 10Acamicamacaraca: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [00:13:04] (03CR) 10CI reject: [V: 04-1] Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [00:15:57] (03PS8) 10Acamicamacaraca: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [00:16:32] (03CR) 10CI reject: [V: 04-1] Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [00:23:34] (03PS9) 10Acamicamacaraca: Add "editautopatrolprotected" and "editpatrolprotected" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [00:25:46] (03CR) 10Acamicamacaraca: "Shoud be updated now!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [00:39:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953499 [00:39:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953499 (owner: 10TrainBranchBot) [00:43:15] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953499 (owner: 10TrainBranchBot) [01:09:31] PROBLEM - snapshot of s6 in eqiad on backupmon1001 is CRITICAL: snapshot for s6 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-08-30 01:04:15 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:03:57] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:40:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:13:15] (03PS1) 10Ryan Kemper: wdqs: silence alerts on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/954350 (https://phabricator.wikimedia.org/T345475) [05:32:18] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1008.eqiad.wmnet [05:32:55] yep, 'tis me, the maintenance window for the misc dumps worker and nfs share are on the weekend [05:38:55] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1008.eqiad.wmnet [05:39:32] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet [05:45:43] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet [06:01:19] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:16:53] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:35] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:55:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:05] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:35] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:15:01] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:15] (03PS1) 10Majavah: alertmanager: re-add space to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/954355 [15:28:16] (03PS2) 10Majavah: alertmanager: re-add space to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/954355 [15:46:34] PROBLEM - MariaDB Replica Lag: s1 #page on db1128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:47:37] I'm on phone [15:47:44] Can someone depool it [15:48:27] on it [15:49:04] !log sukhe@cumin2002 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P52244 and previous config saved to /var/cache/conftool/dbconfig/20230902-154903-sukhe.json [15:49:33] Amir1: anything else that needs to be done here? should I downtime it as well? [15:50:06] Yeah downtime [15:50:11] Please [15:50:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1128.eqiad.wmnet with reason: depooled after replica lag page, two days [15:51:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1128.eqiad.wmnet with reason: depooled after replica lag page, two days [15:52:28] Thanks [15:52:31] Amir1: all done [17:06:35] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:23:15] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 3.67 ms [18:18:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:35:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:40:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency