[00:01:39] <icinga-wm>	 RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:00] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:12:15] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:13:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[00:13:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kafka-logging: stop kafka services on kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[00:14:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[00:17:59] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:41] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:25] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 
[00:48:25] <icinga-wm>	 th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:48:29] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:37] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.73:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[00:48:59] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:05] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.75:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[00:49:09] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.74:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[00:52:11] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:54:45] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:11] <wikibugs>	 (03PS1) 10Nray: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052)
[01:06:26] <wikibugs>	 (03CR) 10Nray: [C: 04-1] "waiting for green light from Olga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray)
[01:12:59] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:43] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:59] <urandom>	 !log powercycling restbase2025 — T332462
[01:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:04] <stashbot>	 T332462: restbase2025 is down - https://phabricator.wikimedia.org/T332462
[01:23:17] <icinga-wm>	 PROBLEM - Host restbase2025 is DOWN: PING CRITICAL - Packet loss = 100%
[01:23:43] <icinga-wm>	 RECOVERY - Host restbase2025 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms
[01:23:49] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.32.75:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:24:07] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:24:21] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:24:21] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:24:37] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.73:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:25:29] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.74:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:26:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2025 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:26:17] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2025 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:26:17] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2025 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:28:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:28:27] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.32.73:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-a valid until 2023-12-09 16:37:31 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:28:37] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.32.75:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-c valid until 2023-12-09 16:37:36 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:28:55] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.32.73:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.73 port 9042 https://phabricator.wikimedia.org/T93886
[01:29:19] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.74:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-b valid until 2023-12-09 16:37:33 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:29:25] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.75:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.75 port 9042 https://phabricator.wikimedia.org/T93886
[01:29:27] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.74:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.74 port 9042 https://phabricator.wikimedia.org/T93886
[01:37:51] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:35] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:37] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:21] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:33] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:19] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:57:26] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[02:57:31] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 05s)
[03:12:09] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:53] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:49] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:33] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:04] <wikibugs>	 (03CR) 10Tim Starling: Unprovision the "swift" dashboard (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling)
[04:07:13] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:41] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:57] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:25] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:17] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:01] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:50] <wikibugs>	 (03CR) 10Gergő Tisza: "Do we need this? On production, the vendor backport should do the thing, and I don't think anything else cares about the wmf/ branches. (O" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester)
[05:09:03] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:51] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:14:49] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:39] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:20:38] <wikibugs>	 (03PS1) 10David Martin: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752
[05:37:21] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:41:45] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:07] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:09] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:33] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:48:57] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:13:51] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:31] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:42:45] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:21] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:49] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:23] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230318T0700)
[07:09:11] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:53] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:47] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:03] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:29] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:47] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:55:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC
[07:55:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC
[07:56:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:01:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:12:13] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:25] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:51] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:15] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:19] <icinga-wm>	 RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:08] <wikibugs>	 (03PS4) 10Acamicamacaraca: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468)
[10:05:58] <wikibugs>	 (03PS5) 10Acamicamacaraca: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468)
[10:12:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318)
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:31:12] <wikibugs>	 (03PS2) 10Zoranzoki21: [WIP] Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762)
[12:31:45] <wikibugs>	 (03CR) 10Zoranzoki21: [WIP] Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21)
[12:34:22] <wikibugs>	 (03PS3) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762)
[13:46:44] <apergos>	 !log rsync of xmldata private dir from screen as ariel on dumpsdata1004 to dumpsdata1005, no bandwidth cap
[13:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:02:35] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:07:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:08:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:20:23] <wikibugs>	 (03PS1) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900763 (https://phabricator.wikimedia.org/T332028)
[14:21:54] <wikibugs>	 (03Abandoned) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900763 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[14:26:46] <apergos>	 !log rsync of xmldata public dir  from screen as ariel on dumpsdata1004 to dumpsdata1005, no bandwidth cap
[14:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:55] <wikibugs>	 (03CR) 10Albertoleoncio: Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21)
[17:42:45] <wikibugs>	 (03PS4) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762)
[17:43:07] <wikibugs>	 (03CR) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21)
[17:55:49] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[17:55:49] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[19:12:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:12:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:14:40] <vgutierrez>	 apergos: is that your rsync?
[19:16:03] <apergos>	 doubtful that's the cause, I've run it other days and never had a problem
[19:16:06] <apergos>	 vgutierrez: 
[19:17:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:17:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:36:48] <elukey>	 apergos: Hi! I pinged you in another chan, dumps1004 is on asw1-a, and https://librenms.wikimedia.org/device/device=160/tab=port/port=21667/ shows a big bw usage
[19:37:23] <elukey>	 it matches with what we are seeing, nothing is exploding right now but if you could bw-cap the rsync it would be better
[19:39:03] <apergos>	 ok, I can do that
[19:39:19] <apergos>	 what's a good cap do you think?  
[19:40:26] <apergos>	 vgutierrez: 
[19:40:29] <apergos>	 er
[19:40:46] <elukey>	 it is running now at around 3Gbps afaics from librenms (Receiving data from 1005)
[19:40:50] <apergos>	 elukey:   (sorry for the gratuitous ping v gutierrez) 
[19:41:05] <elukey>	 we all love Valentin don't worry, more wikilove to him :D
[19:41:10] <apergos>	 lol
[19:41:34] <elukey>	 not super familiar with bw-cap for rsync, even something half to what it is now would be perfect
[19:41:43] <apergos>	 okey dokey
[19:41:52] <elukey>	 thanks!
[19:46:00] <apergos>	 retrying with bandwidth limit 100000  which should be a lot less than you were seeing, please pnig me if I got that wrong, elukey
[19:47:42] <elukey>	 apergos: yep way better! I think that we are good
[19:48:29] <apergos>	 ok great, odd that this caused an alert, since I've run rsyncs over the past few weeks as preparation for setting up various hosts, and we haven't had this issue
[19:48:39] <apergos>	 but anyways, if it's better now, that will do
[19:49:03] <elukey>	 apergos: it probably tipped over because of other traffic, thanks a lot for the follow up! Have a nice evening :)
[19:49:25] <apergos>	 you too!
[22:34:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:39:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:47:34] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[22:47:54] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 19s)
[23:02:03] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) @TheDJ thanks for your comment. These 429 errors "needs to be judged on case b...