[00:46:04] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:48:04] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:57:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:03:58] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:58] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:30:45] <Tamzin>	 503s/slow load times intermittently for me (Eastern US) and a Western US friend
[03:31:02] <jinxer-wm>	 (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:31:11] <AntiComposite>	 got a couple reports of upstream connect error... via discord, now apparently resolved
[03:31:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:31:27] <TheresNoTime>	 ack
[03:31:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:31:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:32:03] <jinxer-wm>	 (ProbeDown) firing: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:32:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[03:33:14] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:34:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[03:36:02] <jinxer-wm>	 (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:36:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:36:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:37:03] <jinxer-wm>	 (ProbeDown) resolved: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:56:57] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:19:46] <TheresNoTime>	 !log T302486 : `[samtar@mwmaint1002 ~]$ mwscript maintenance/fixMergeHistoryCorruption.php --wiki enwiki --dry-run --ns 828`
[04:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:19:50] <stashbot>	 T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486
[04:39:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:44:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:49:40] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL
[04:49:40] <icinga-wm>	 etrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:51:38] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:53:34] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (contint1001, ...), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:16:56] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:22:58] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:56:57] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221204T0800)
[08:01:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:01:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[08:03:32] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[08:05:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:16:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:24] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[08:21:24] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[09:01:42] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[09:21:42] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[09:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:14:58] <wikibugs>	 (03PS1) 10Majavah: P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843
[10:15:00] <wikibugs>	 (03PS1) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844
[10:15:02] <wikibugs>	 (03PS1) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845
[10:15:04] <wikibugs>	 (03PS1) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846
[10:15:06] <wikibugs>	 (03PS1) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847
[10:15:08] <wikibugs>	 (03PS1) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848
[10:15:10] <wikibugs>	 (03PS1) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849
[10:15:12] <wikibugs>	 (03PS1) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850
[10:15:14] <wikibugs>	 (03PS1) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851
[10:15:16] <wikibugs>	 (03PS1) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852
[10:15:18] <wikibugs>	 (03PS1) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853
[10:15:20] <wikibugs>	 (03PS1) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854
[10:15:22] <wikibugs>	 (03PS1) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855
[10:16:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843 (owner: 10Majavah)
[10:27:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 (owner: 10Majavah)
[10:28:36] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[10:30:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 (owner: 10Majavah)
[10:38:36] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[10:46:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854 (owner: 10Majavah)
[10:52:32] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah)
[10:54:25] <wikibugs>	 (03PS2) 10Majavah: P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843
[10:54:27] <wikibugs>	 (03PS2) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844
[10:54:29] <wikibugs>	 (03PS2) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845
[10:54:31] <wikibugs>	 (03PS2) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846
[10:54:33] <wikibugs>	 (03PS2) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847
[10:54:35] <wikibugs>	 (03PS2) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848
[10:54:37] <wikibugs>	 (03PS2) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849
[10:54:39] <wikibugs>	 (03PS2) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850
[10:54:41] <wikibugs>	 (03PS2) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851
[10:54:43] <wikibugs>	 (03PS2) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852
[10:54:45] <wikibugs>	 (03PS2) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853
[10:54:47] <wikibugs>	 (03PS2) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854
[10:54:49] <wikibugs>	 (03PS2) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855
[10:56:21] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38573/console" [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah)
[10:58:27] <wikibugs>	 (03PS3) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844
[10:58:29] <wikibugs>	 (03PS3) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845
[10:58:31] <wikibugs>	 (03PS3) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846
[10:58:33] <wikibugs>	 (03PS3) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847
[10:58:35] <wikibugs>	 (03PS3) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848
[10:58:37] <wikibugs>	 (03PS3) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849
[10:58:39] <wikibugs>	 (03PS3) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850
[10:58:41] <wikibugs>	 (03PS3) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851
[10:58:43] <wikibugs>	 (03PS3) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852
[10:58:45] <wikibugs>	 (03PS3) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853
[10:58:47] <wikibugs>	 (03PS3) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854
[10:58:49] <wikibugs>	 (03PS3) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855
[10:59:54] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38574/console" [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah)
[11:08:42] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[12:20:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:25:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:46:08] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:54:02] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:59:34] <icinga-wm>	 PROBLEM - SSH on db1122.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:24:28] <icinga-wm>	 PROBLEM - Check systemd state on ncredir5001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:40] <wikibugs>	 (03PS1) 10Majavah: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874
[13:42:34] <icinga-wm>	 RECOVERY - Check systemd state on ncredir5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:00:24] <icinga-wm>	 RECOVERY - SSH on db1122.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:16:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah)
[14:32:17] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:04:34] <wikibugs>	 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) I have problems with my Search Console permissions. I cannot use "Request Indexing" tool. @jbond, can you check? Also, can you specifically add the "h...
[16:23:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:25:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:35:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:54] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:40:54] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:45:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:29:45] <TheresNoTime>	 Ref T302486, I did a dry run against the Module namespace and it returned the two expected results — is there any reason I shouldn't plan to run that proper?
[20:29:46] <stashbot>	 T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486
[20:51:46] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[21:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:23:43] <RhinosF1>	 TheresNoTime: id say ask dba if it was a lot of writes but that looks fine
[21:24:26] <RhinosF1>	 I'd say plan
[21:24:37] <RhinosF1>	 Maybe not on a Sunday evening though in case it does go weird
[21:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:49:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:51:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert