[00:03:44] <icinga-wm>	 PROBLEM - Check unit status of prune_old_srv_syslog_directories on centrallog2002 is CRITICAL: CRITICAL: Status of the systemd unit prune_old_srv_syslog_directories https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:06:50] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:06:52] <icinga-wm>	 PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: prune_old_srv_syslog_directories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:28] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:40:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[00:41:06] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[00:47:32] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:08:08] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:20:24] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:50] <GenNotability>	 not sure if this is the right place to ask, but is anyone around who understands trusted XFF? I've got a checkuser queue complaint from a zscaler rep that they're blocked, but ta.avi and ur.banecm merged a patch last month that added the ranges they gave me to trusted xff, so...not sure where to send them for help
[01:37:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:39:32] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:40:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:52:18] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:59:16] <GenNotability>	 (been helped in DM)
[02:09:56] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:16:18] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:37:03] <wikibugs>	 (03PS1) 10RLazarus: miscweb: Update envoy to 1.15.5-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/766208 (https://phabricator.wikimedia.org/T300324)
[03:37:05] <wikibugs>	 (03PS1) 10RLazarus: miscweb: Update envoy to 1.15.5-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766209 (https://phabricator.wikimedia.org/T300324)
[03:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[04:44:29] <icinga-wm>	 PROBLEM - Host text-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:29] <icinga-wm>	 PROBLEM - Host ncredir-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:29] <icinga-wm>	 PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:29] <icinga-wm>	 PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:29] <icinga-wm>	 PROBLEM - Host ncredir-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:29] <icinga-wm>	 PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:30] <rzl>	 👋
[04:45:30] <icinga-wm>	 PROBLEM - Host cr2-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:41] <rzl>	 paged but it's just drmrs, I assume safe to ignore
[04:45:44] <icinga-wm>	 PROBLEM - Host prometheus6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:12] <icinga-wm>	 PROBLEM - Host install6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:48] <icinga-wm>	 PROBLEM - Host upload-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:50] <icinga-wm>	 PROBLEM - Host bast6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:52] <icinga-wm>	 PROBLEM - Host cr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:53] <icinga-wm>	 PROBLEM - Host cr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[04:47:01] <icinga-wm>	 PROBLEM - Host upload-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[04:47:01] <icinga-wm>	 PROBLEM - Host text-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[04:48:04] <icinga-wm>	 PROBLEM - Host cr2-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:49:32] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:50:38] <icinga-wm>	 PROBLEM - Host asw1-b12-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:50:38] <icinga-wm>	 PROBLEM - Host asw1-b13-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:50:44] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:51:20] <icinga-wm>	 PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:52:53] <rzl>	 yep looks like a bunch of icinga downtimes expired within the last couple of days, re-downtiming I guess
[04:53:26] <jinxer-wm>	 (ProbeHttpFailed) firing: (9) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[04:53:29] <icinga-wm>	 RECOVERY - Host upload-lb.drmrs.wikimedia.org is UP: PING WARNING - Packet loss = 71%, RTA = 85.22 ms
[04:53:30] <icinga-wm>	 RECOVERY - Host text-lb.drmrs.wikimedia.org is UP: PING WARNING - Packet loss = 71%, RTA = 86.37 ms
[04:53:30] <icinga-wm>	 RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 85.70 ms
[04:53:30] <icinga-wm>	 RECOVERY - Host prometheus6001 is UP: PING OK - Packet loss = 0%, RTA = 85.50 ms
[04:53:32] <icinga-wm>	 RECOVERY - Host bast6001 is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms
[04:53:33] <icinga-wm>	 RECOVERY - Host upload-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.01 ms
[04:53:33] <icinga-wm>	 RECOVERY - Host text-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.21 ms
[04:53:34] <icinga-wm>	 RECOVERY - Host cr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.87 ms
[04:53:34] <icinga-wm>	 RECOVERY - Host cr2-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.61 ms
[04:53:35] <icinga-wm>	 RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 85.63 ms
[04:56:13] <icinga-wm>	 PROBLEM - LVS upload drmrs port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:57:00] <icinga-wm>	 PROBLEM - Host cr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[04:57:01] <icinga-wm>	 PROBLEM - Host cr2-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[04:57:14] <icinga-wm>	 PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:57:28] <icinga-wm>	 PROBLEM - Host bast6001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:57:38] <icinga-wm>	 PROBLEM - Host text-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:57:56] <legoktm>	 Is drmrs serving real traffic?
[04:58:09] <rzl>	 not yet
[05:01:44] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:03:30] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:07:24] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:08:25] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:08:43] <jinxer-wm>	 (ProbeHttpFailed) resolved: (9) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[05:09:04] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:26:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:36:53] <Juan_90264>	 I cannot access the domain pt.wikipedia.org, I was accessing it just now and when I tried to enter another page on the domain it is not loading
[05:37:27] <Juan_90264>	 Is there an error on the servers?
[05:39:14] <Juan_90264>	 Hello?
[05:39:54] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:43:27] <Juan_90264>	 It seems that only the wikipedia.org domain is not loading, the others are working. Certificates and cookies load, but the page does not...
[05:48:02] <Juan_90264>	 Forget it, mine was just a glitch on my internet
[05:52:36] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:36] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:53] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10AndyRussG) Thanks so much once again @SCherukuwada and @jcrespo for your careful attention to all these important details!!!! :) :)
[06:53:30] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:24:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:25:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:29:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:30:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:34:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:36:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:39:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:41:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:41:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:44:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:44:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:46:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:49:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (5) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:49:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (5) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:51:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:52:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1011:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[07:54:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (4) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:57:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1011:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:59:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (6) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:00:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1008:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:04:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (7) Blazegraph instance wdqs1004:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:05:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1008:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:09:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1008:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:09:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (6) Blazegraph instance wdqs1007:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:10:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:14:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1008:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:23:48] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:41:20] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:55:02] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:00:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[10:40:12] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:18] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:04] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:24:44] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:25:14] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:38:54] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:45:16] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[12:24:22] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:25:02] <hauskatze>	 I'm having issues trying to connect to the sites; anyone else with the same issue?
[12:25:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[12:27:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[12:27:14] <Daimona>	 Yes
[12:28:31] <Daimona>	 At least -- I had some issues a few minutes ago. Seems better now
[12:29:30] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[12:30:29] <wikibugs>	 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10alaa) Happened again between 12:25 and 12:28 UTC but things are back to normal now.
[12:30:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[12:31:52] <hauskatze>	 Daimona: thanks :) I got few 503s
[12:32:27] <Daimona>	 I got https://phabricator.wikimedia.org/T301505 but just for a couple of minutes as the last comment says
[12:34:36] <hauskatze>	 oh yup, got that one too once
[13:00:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[13:08:58] <wikibugs>	 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Func)
[13:28:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:29:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:38:02] <wikibugs>	 (03PS1) 10Volans: setup.py: temporary limit prospector version [cookbooks] - 10https://gerrit.wikimedia.org/r/766227
[13:42:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Merging to allow CI to run on other patches. Will revert once upstream has a fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/766227 (owner: 10Volans)
[13:43:00] <wikibugs>	 (03Abandoned) 10Volans: Test change [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 (owner: 10Razzi)
[13:45:08] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: temporary limit prospector version [cookbooks] - 10https://gerrit.wikimedia.org/r/766227 (owner: 10Volans)
[13:45:36] <wikibugs>	 (03PS12) 10Volans: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[13:46:28] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:52] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:49:53] <wikibugs>	 (03CR) 10Volans: Add cookbooks for running maintain-views (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[13:53:35] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:12] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:08:54] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:14:32] <wikibugs>	 (03PS1) 10Zabe: Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956)
[14:26:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:32:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: temp mute drmrs [puppet] - 10https://gerrit.wikimedia.org/r/766230
[14:32:39] <godog>	 sigh, smokeping is spamming and I'm muting drmrs
[14:33:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: temp mute drmrs [puppet] - 10https://gerrit.wikimedia.org/r/766230 (owner: 10Filippo Giunchedi)
[14:39:52] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:02:08] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:38] <wikibugs>	 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Wargo) Where is documentation of this issue? The returns make think it is not resolved.
[15:09:16] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:34] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:38] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[16:25:16] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:27:50] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:00] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:00] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:41:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:42:00] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:56:22] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:00:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[17:10:08] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:10:06] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:23:50] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:40:48] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:41:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:43:05] <wikibugs>	 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ciell) Thanks!
[18:54:58] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[20:26:06] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:39:54] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:40:24] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:43:34] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:54:12] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:00:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[22:11:12] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:25:02] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:25:24] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:39:12] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[23:56:20] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers