[00:18:17] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:27:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:32:11] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840 [00:38:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840 (owner: 10TrainBranchBot) [01:00:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840 (owner: 10TrainBranchBot) [01:39:16] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:37:11] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:00] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:16] (03PS1) 10RLazarus: k8s-controller-sidecars: Bump to 1.0.2-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/988130 [03:26:32] (03CR) 10RLazarus: [V: 03+2 C: 03+2] k8s-controller-sidecars: Bump to 1.0.2-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/988130 (owner: 10RLazarus) [04:18:18] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:30:42] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10andrea.denisse) Hi @KFrancis I hope you're doing well. I wanted to check if Dima has completed the NDA process with the Legal department of the WMF as it's a prerequisite to be added to the... [05:39:16] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:39:34] 10SRE, 10Maps, 10Traffic, 10serviceops: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul) 05Open→03Resolved a:03Nicolas_Raoul Actually just changing our Referer HTTP header to https://maps.wikimedia.org did the trick. 🙂 [07:13:34] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988131 (https://phabricator.wikimedia.org/T352583) [07:25:21] (03PS1) 10Andrea Denisse: admin: Add dimakoushha to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276) [07:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:13:16] (03PS1) 10Andrea Denisse: admin: Add arthurtaylor to restricted [puppet] - 10https://gerrit.wikimedia.org/r/988133 (https://phabricator.wikimedia.org/T354049) [08:14:47] (03CR) 10CI reject: [V: 04-1] admin: Add arthurtaylor to restricted [puppet] - 10https://gerrit.wikimedia.org/r/988133 (https://phabricator.wikimedia.org/T354049) (owner: 10Andrea Denisse) [08:16:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) I'm tagging @thcipriani for his approval on this request, as he is approver for the 'restricted' group. [08:18:18] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:02:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10Aklapper) > Wikimedia developer account username: Arthur Taylor How / where was this account created? `ldapsearch -xxx cn="Arthur Taylor"` says `cn` and `sn` a... [09:39:17] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:17:10] (03PS2) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) [10:18:19] (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede) [10:24:01] (03PS3) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) [10:25:11] (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede) [10:35:29] (03PS4) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) [10:36:39] (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede) [10:50:41] (03PS5) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) [10:54:03] (03CR) 10Slyngshede: C:raid::perccli Support compression of output. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede) [11:45:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:45:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:48:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:52:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:53:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:53:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:18] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:44:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:45:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:16] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:41:26] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:27:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:32:11] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:11] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:11] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:05] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:54] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:04:18] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:50] upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 on ticket.wikimedia.org [15:14:49] #page ^ [15:14:55] Can someone trigger klaxon? [15:17:35] Paging [15:18:11] Let's open a task too [15:19:16] I'm here, but only for a few minutes. Looking at it now [15:19:34] https://phabricator.wikimedia.org/T354478 [15:19:40] Looks like it got OOM-killed. [15:19:55] I'll take a look. vrts burps happened in the past too [15:20:03] Restarting apache now [15:20:58] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:21:22] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:31] It's back now. [15:21:46] up now, thanks [15:22:12] great thanks. I think puppet would have repaired that that also in 30m. But good apache is running again [15:22:34] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.Wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10RhinosF1) [15:22:54] thanks folx, hopefully not too disruptive for your Saturday! [15:23:05] (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:43] That was quick :-) [15:24:53] I was walking past my laptop when my phone went off. Good timing all round! [15:25:07] the pa.ge should resolve soon as well [15:25:40] It's a manual page so I'll resolve it [15:25:42] jelto: the pa.ge was manual [15:26:01] I've created a task for upgrading the alerting to paging [15:26:27] ah yes you are right, thanks :) sobanski resolved the pa.ge [15:27:34] RhinosF1: Thanks, I didn't realise it wasn't paging yet, it should be. [15:31:08] I'm stepping away again, if this happens again and we aren't around, it's likely that restarting the apache2/clamav processes will fix it if stopped. [15:34:54] I'm also out again, yes worst case puppet will restart apache after 30 minutes [16:02:33] (03PS1) 10Krinkle: Fix parsing logic when comments or hidden characters are present [extensions/Gadgets] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987999 (https://phabricator.wikimedia.org/T354385) [16:18:33] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:39:17] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:57:42] (03PS5) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) [17:58:13] (03CR) 10D3r1ck01: "We plan to deploy this next week. @Timo, does this look good to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [17:59:50] (03PS6) 10D3r1ck01: wmf-config: Remove unused wgStatsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) [19:52:08] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Reedy) [20:18:33] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:27:11] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:01] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:39:17] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:18:09] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:26:15] (03CR) 10Krinkle: [C: 03+1] "LGTM. Codesearch is clean for "StatsCacheType" and recent changes that removed remaining usage have been deployed since." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [22:27:08] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:27:11] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:30] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:29:02] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:32:11] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable