[00:18:17] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:27:11] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:32:11] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840
[00:38:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840 (owner: 10TrainBranchBot)
[01:00:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986840 (owner: 10TrainBranchBot)
[01:39:16] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[02:37:11] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:00] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:25:16] <wikibugs>	 (03PS1) 10RLazarus: k8s-controller-sidecars: Bump to 1.0.2-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/988130
[03:26:32] <wikibugs>	 (03CR) 10RLazarus: [V: 03+2 C: 03+2] k8s-controller-sidecars: Bump to 1.0.2-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/988130 (owner: 10RLazarus)
[04:18:18] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:30:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10andrea.denisse) Hi @KFrancis I hope you're doing well. I wanted to check if Dima has completed the NDA process with the Legal department of the WMF as it's a prerequisite to be added to the...
[05:39:16] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[05:39:34] <wikibugs>	 10SRE, 10Maps, 10Traffic, 10serviceops: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul) 05Open→03Resolved a:03Nicolas_Raoul Actually just changing our Referer HTTP header to https://maps.wikimedia.org did the trick. 🙂
[07:13:34] <wikibugs>	 (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988131 (https://phabricator.wikimedia.org/T352583)
[07:25:21] <wikibugs>	 (03PS1) 10Andrea Denisse: admin: Add dimakoushha to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/988132 (https://phabricator.wikimedia.org/T354276)
[07:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[08:13:16] <wikibugs>	 (03PS1) 10Andrea Denisse: admin: Add arthurtaylor to restricted [puppet] - 10https://gerrit.wikimedia.org/r/988133 (https://phabricator.wikimedia.org/T354049)
[08:14:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add arthurtaylor to restricted [puppet] - 10https://gerrit.wikimedia.org/r/988133 (https://phabricator.wikimedia.org/T354049) (owner: 10Andrea Denisse)
[08:16:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) I'm tagging @thcipriani  for his approval on this request, as he is approver for the 'restricted' group.
[08:18:18] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[08:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[09:02:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10Aklapper) > Wikimedia developer account username: Arthur Taylor  How / where was this account created? `ldapsearch -xxx cn="Arthur Taylor"` says `cn` and `sn` a...
[09:39:17] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:17:10] <wikibugs>	 (03PS2) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254)
[10:18:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede)
[10:24:01] <wikibugs>	 (03PS3) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254)
[10:25:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede)
[10:35:29] <wikibugs>	 (03PS4) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254)
[10:36:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede)
[10:50:41] <wikibugs>	 (03PS5) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254)
[10:54:03] <wikibugs>	 (03CR) 10Slyngshede: C:raid::perccli Support compression of output. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede)
[11:45:04] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:45:24] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:48:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:52:44] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:53:00] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:53:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:18] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:44:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:45:28] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:16] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[13:41:26] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:27:11] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:32:11] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:57:11] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:05] <jinxer-wm>	 (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:03:54] <icinga-wm>	 PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[15:04:18] <icinga-wm>	 PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:50] <AntiComposite>	 upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 on ticket.wikimedia.org
[15:14:49] <RhinosF1>	 #page ^
[15:14:55] <RhinosF1>	 Can someone trigger klaxon?
[15:17:35] <TheresNoTime>	 Paging 
[15:18:11] <RhinosF1>	 Let's open a task too
[15:19:16] <eoghan>	 I'm here, but only for a few minutes. Looking at it now
[15:19:34] <RhinosF1>	 https://phabricator.wikimedia.org/T354478
[15:19:40] <eoghan>	 Looks like it got OOM-killed.
[15:19:55] <jelto>	 I'll take a look. vrts burps happened in the past too
[15:20:03] <eoghan>	 Restarting apache now
[15:20:58] <icinga-wm>	 RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[15:21:22] <icinga-wm>	 RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:31] <eoghan>	 It's back now. 
[15:21:46] <AntiComposite>	 up now, thanks
[15:22:12] <jelto>	 great thanks. I think puppet would have repaired that that also in 30m. But good apache is running again
[15:22:34] <wikibugs>	 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.Wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10RhinosF1)
[15:22:54] <TheresNoTime>	 thanks folx, hopefully not too disruptive for your Saturday!
[15:23:05] <jinxer-wm>	 (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:43] <slyngs>	 That was quick :-)
[15:24:53] <eoghan>	 I was walking past my laptop when my phone went off. Good timing all round!
[15:25:07] <jelto>	 the pa.ge should resolve soon as well
[15:25:40] <sobanski>	 It's a manual page so I'll resolve it
[15:25:42] <RhinosF1>	 jelto: the pa.ge was manual
[15:26:01] <RhinosF1>	 I've created a task for upgrading the alerting to paging
[15:26:27] <jelto>	 ah yes you are right, thanks :) sobanski resolved the pa.ge
[15:27:34] <eoghan>	 RhinosF1: Thanks, I didn't realise it wasn't paging yet, it should be. 
[15:31:08] <eoghan>	 I'm stepping away again, if this happens again and we aren't around, it's likely that restarting the apache2/clamav processes will fix it if stopped. 
[15:34:54] <jelto>	 I'm also out again, yes worst case puppet will restart apache after 30 minutes
[16:02:33] <wikibugs>	 (03PS1) 10Krinkle: Fix parsing logic when comments or hidden characters are present [extensions/Gadgets] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987999 (https://phabricator.wikimedia.org/T354385)
[16:18:33] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:39:17] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:57:42] <wikibugs>	 (03PS5) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004)
[17:58:13] <wikibugs>	 (03CR) 10D3r1ck01: "We plan to deploy this next week. @Timo, does this look good to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01)
[17:59:50] <wikibugs>	 (03PS6) 10D3r1ck01: wmf-config: Remove unused wgStatsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004)
[19:52:08] <wikibugs>	 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Reedy)
[20:18:33] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:27:11] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:29:01] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:39:17] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:18:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[22:26:15] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Codesearch is clean for "StatsCacheType" and recent changes that removed remaining usage have been deployed since." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01)
[22:27:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[22:27:11] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:27:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[22:29:02] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:32:11] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable