[00:46:04] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:48:04] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:52:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:58] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:58] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:30:45] 503s/slow load times intermittently for me (Eastern US) and a Western US friend [03:31:02] (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:31:11] got a couple reports of upstream connect error... via discord, now apparently resolved [03:31:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:31:27] ack [03:31:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:31:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:32:03] (ProbeDown) firing: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:58] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:33:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:34:58] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:36:02] (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:36:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:36:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:37:03] (ProbeDown) resolved: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:56:57] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:19:46] !log T302486 : `[samtar@mwmaint1002 ~]$ mwscript maintenance/fixMergeHistoryCorruption.php --wiki enwiki --dry-run --ns 828` [04:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:50] T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486 [04:39:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:40] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL [04:49:40] etrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:51:38] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:53:34] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (contint1001, ...), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:16:56] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:22:58] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:56:57] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221204T0800) [08:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:01:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:03:32] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:05:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:16:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:24] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:21:24] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:01:42] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:21:42] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:14:58] (03PS1) 10Majavah: P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843 [10:15:00] (03PS1) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844 [10:15:02] (03PS1) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845 [10:15:04] (03PS1) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846 [10:15:06] (03PS1) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847 [10:15:08] (03PS1) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848 [10:15:10] (03PS1) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849 [10:15:12] (03PS1) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 [10:15:14] (03PS1) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 [10:15:16] (03PS1) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852 [10:15:18] (03PS1) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853 [10:15:20] (03PS1) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854 [10:15:22] (03PS1) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 [10:16:51] (03CR) 10CI reject: [V: 04-1] P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843 (owner: 10Majavah) [10:27:05] (03CR) 10CI reject: [V: 04-1] P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 (owner: 10Majavah) [10:28:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [10:30:48] (03CR) 10CI reject: [V: 04-1] P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 (owner: 10Majavah) [10:38:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [10:46:16] (03CR) 10CI reject: [V: 04-1] P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854 (owner: 10Majavah) [10:52:32] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:13] (03CR) 10CI reject: [V: 04-1] P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah) [10:54:25] (03PS2) 10Majavah: P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843 [10:54:27] (03PS2) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844 [10:54:29] (03PS2) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845 [10:54:31] (03PS2) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846 [10:54:33] (03PS2) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847 [10:54:35] (03PS2) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848 [10:54:37] (03PS2) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849 [10:54:39] (03PS2) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 [10:54:41] (03PS2) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 [10:54:43] (03PS2) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852 [10:54:45] (03PS2) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853 [10:54:47] (03PS2) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854 [10:54:49] (03PS2) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 [10:56:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38573/console" [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah) [10:58:27] (03PS3) 10Majavah: P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844 [10:58:29] (03PS3) 10Majavah: P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845 [10:58:31] (03PS3) 10Majavah: P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846 [10:58:33] (03PS3) 10Majavah: P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847 [10:58:35] (03PS3) 10Majavah: P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848 [10:58:37] (03PS3) 10Majavah: P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849 [10:58:39] (03PS3) 10Majavah: P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 [10:58:41] (03PS3) 10Majavah: P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 [10:58:43] (03PS3) 10Majavah: P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852 [10:58:45] (03PS3) 10Majavah: P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853 [10:58:47] (03PS3) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863854 [10:58:49] (03PS3) 10Majavah: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 [10:59:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38574/console" [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah) [11:08:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [12:20:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:08] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [12:54:02] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:59:34] PROBLEM - SSH on db1122.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:28] PROBLEM - Check systemd state on ncredir5001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:40] (03PS1) 10Majavah: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 [13:42:34] RECOVERY - Check systemd state on ncredir5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:00:24] RECOVERY - SSH on db1122.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:16:57] (03CR) 10CI reject: [V: 04-1] puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [14:32:17] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:04:34] 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) I have problems with my Search Console permissions. I cannot use "Request Indexing" tool. @jbond, can you check? Also, can you specifically add the "h... [16:23:50] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:25:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:35:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:54] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:40:54] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:45:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:29:45] Ref T302486, I did a dry run against the Module namespace and it returned the two expected results — is there any reason I shouldn't plan to run that proper? [20:29:46] T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486 [20:51:46] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:16:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:23:43] TheresNoTime: id say ask dba if it was a lot of writes but that looks fine [21:24:26] I'd say plan [21:24:37] Maybe not on a Sunday evening though in case it does go weird [21:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:49:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:51:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert