[00:05:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:24:42] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[00:24:48] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:25:26] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:35:26] <jinxer-wm>	 (RoutinatorRsyncErrors) resolved: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[00:38:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009403
[00:38:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009403 (owner: 10TrainBranchBot)
[01:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009403 (owner: 10TrainBranchBot)
[01:03:48] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:03:55] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:09:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:09:37] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009859 (https://phabricator.wikimedia.org/T349774)
[01:09:38] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:13:16] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:13:23] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:27:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:28:00] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:36:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:36:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[01:39:07] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[01:39:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[02:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:32:46] <wikibugs>	 06SRE, 10Observability-Alerting: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927#9618148 (10Krinkle)
[02:37:13] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:39] <icinga-wm>	 RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops
[02:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:54:28] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[02:54:35] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[03:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:12:13] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:14:25] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:34:54] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[03:35:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[04:35:23] <icinga-wm>	 PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100%
[04:40:53] <icinga-wm>	 RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[04:45:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9618274 (10Himejijo) @Marostegui I'm sorry I'm still confused about which is which.   Is it associated with the rkhan-ctr@wikipedia.org address?
[04:49:17] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 60%, RTA = 30.25 ms
[04:55:41] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[05:06:15] <icinga-wm>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 77%, RTA = 1.23 ms
[05:07:26] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:12:39] <icinga-wm>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[05:14:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:19:55] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:23:16] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9618279 (10Marostegui) >>! In T359490#9618274, @Himejijo wrote: > @Marostegui I'm sorry I'm still confused about which is which.  >  > Is it associated with the rkhan-ctr@wikipedia.org ad...
[05:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:26:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:26:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9618300 (10Himejijo) Thank you I appreciate you taking a look.   I'll point her to this on Monday; though isn't that the email listed in the original ticket?  Is it that either account ne...
[07:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:12:14] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:13:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9618302 (10Himejijo) Thank you I appreciate you taking a look.   I'll point her to this on Monday; though isn't that the email listed in the original ticket?  Is it that either account ne...
[07:48:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:48:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240310T0800)
[08:49:40] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[08:52:01] <icinga-wm>	 PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:05] <icinga-wm>	 RECOVERY - Host an-worker1168 is UP: PING WARNING - Packet loss = 50%, RTA = 0.29 ms
[09:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:10] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:41:44] <jinxer-wm>	 (HaproxyUnavailable) firing: (2) HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:46:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: (2) HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:47:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:49:59] <XioNoX>	 hotlinking to /wikipedia/commons/3/3d/Amir_Hussain_Armless_Cricketer.jpg
[09:50:03] <XioNoX>	 went up and now going down
[09:52:46] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[10:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:33:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9618338 (10Marostegui) Yes but on the original ticket your username listed is Himejijo, which doesn't match with that email. That's why I believe you just need to confirm that that userna...
[11:12:28] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:19:55] <jinxer-wm>	 (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:25] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:27] <icinga-wm>	 PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[12:49:55] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[13:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:51] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1034.eqiad.wmnet
[13:49:51] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1035.eqiad.wmnet
[13:49:52] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1036.eqiad.wmnet
[13:49:52] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1037.eqiad.wmnet
[13:49:53] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1038.eqiad.wmnet
[13:49:53] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1039.eqiad.wmnet
[13:49:54] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1040.eqiad.wmnet
[13:49:54] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1041.eqiad.wmnet
[13:49:55] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=codfw,name=restbase1042.eqiad.wmnet
[13:51:30] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1034.eqiad.wmnet
[13:51:32] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1035.eqiad.wmnet
[13:51:34] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1036.eqiad.wmnet
[13:51:37] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1037.eqiad.wmnet
[13:51:39] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1038.eqiad.wmnet
[13:51:41] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1039.eqiad.wmnet
[13:51:48] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1040.eqiad.wmnet
[13:51:51] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1041.eqiad.wmnet
[13:51:53] <logmsgbot>	 !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1042.eqiad.wmnet
[14:03:23] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for 9 hosts
[14:03:27] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts
[14:18:58] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1019.eqiad.wmnet with reason: Decommissioning — T354561
[14:19:03] <stashbot>	 T354561: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561
[14:19:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1019.eqiad.wmnet with reason: Decommissioning — T354561
[14:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:37:14] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:57:14] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:27] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[15:23:25] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:11] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 86 probes of 805 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:29:11] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 10 probes of 805 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:21:13] <wikibugs>	 06SRE: Kazakhstan changed its time zone to a single UTC+5 - https://phabricator.wikimedia.org/T359767#9618492 (10bd808) The WMF production Debian Buster servers appear to have tzdata 2021a-0+deb10u12 installed. The changelog for this package does not appear to include Kazakhstan DST changes later than: ` Release...
[16:24:19] <icinga-wm>	 PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:41] <icinga-wm>	 RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[16:43:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:43:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:49:55] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[17:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:07:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:12:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:23:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:28:55] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:26:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:36:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:49:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 43.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:57:28] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:59:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:23:57] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 75%, RTA = 30.28 ms
[19:30:21] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[19:46:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:50:49] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[20:15:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on an-worker1168 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:49:55] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[21:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:12:09] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: upload.wikimedia.beta.wmflabs.org: cannot find server - https://phabricator.wikimedia.org/T359768#9618681 (10AlexisJazz)
[21:29:10] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:28:32] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:28:39] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:57:28] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:26:33] <icinga-wm>	 PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54107 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops
[23:28:55] <jinxer-wm>	 (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:32:25] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed