[00:01:39] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:00] (03CR) 10Cwhite: [C: 03+1] o11y: deploy prometheus alerts to all instances [alerts] - 10https://gerrit.wikimedia.org/r/900628 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:12:15] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:28] (03CR) 10Cwhite: [C: 03+1] kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [00:13:46] (03CR) 10Cwhite: [C: 03+1] kafka-logging: stop kafka services on kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [00:14:08] (03CR) 10Cwhite: [C: 03+1] alerting_host: failover icinga and alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [00:17:59] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:41] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:25] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, [00:48:25] th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [00:48:29] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:37] PROBLEM - cassandra-a CQL 10.192.32.73:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [00:48:59] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:05] PROBLEM - cassandra-c CQL 10.192.32.75:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [00:49:09] PROBLEM - cassandra-b CQL 10.192.32.74:9042 on restbase2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [00:52:11] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:54:45] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:11] (03PS1) 10Nray: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) [01:06:26] (03CR) 10Nray: [C: 04-1] "waiting for green light from Olga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray) [01:12:59] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:43] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:59] !log powercycling restbase2025 — T332462 [01:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:04] T332462: restbase2025 is down - https://phabricator.wikimedia.org/T332462 [01:23:17] PROBLEM - Host restbase2025 is DOWN: PING CRITICAL - Packet loss = 100% [01:23:43] RECOVERY - Host restbase2025 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [01:23:49] PROBLEM - cassandra-c SSL 10.192.32.75:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:24:07] PROBLEM - cassandra-c service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:24:21] PROBLEM - cassandra-b service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:24:21] PROBLEM - cassandra-a service on restbase2025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:24:37] PROBLEM - cassandra-a SSL 10.192.32.73:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:25:29] PROBLEM - cassandra-b SSL 10.192.32.74:7001 on restbase2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:26:03] RECOVERY - cassandra-c service on restbase2025 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:26:17] RECOVERY - cassandra-a service on restbase2025 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:26:17] RECOVERY - cassandra-b service on restbase2025 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:28:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:27] RECOVERY - cassandra-a SSL 10.192.32.73:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-a valid until 2023-12-09 16:37:31 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:28:37] RECOVERY - cassandra-c SSL 10.192.32.75:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-c valid until 2023-12-09 16:37:36 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:28:55] RECOVERY - cassandra-a CQL 10.192.32.73:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.73 port 9042 https://phabricator.wikimedia.org/T93886 [01:29:19] RECOVERY - cassandra-b SSL 10.192.32.74:7001 on restbase2025 is OK: SSL OK - Certificate restbase2025-b valid until 2023-12-09 16:37:33 +0000 (expires in 266 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:29:25] RECOVERY - cassandra-c CQL 10.192.32.75:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.75 port 9042 https://phabricator.wikimedia.org/T93886 [01:29:27] RECOVERY - cassandra-b CQL 10.192.32.74:9042 on restbase2025 is OK: TCP OK - 0.033 second response time on 10.192.32.74 port 9042 https://phabricator.wikimedia.org/T93886 [01:37:51] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:35] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:37] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:21] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:33] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:19] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:57:26] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [02:57:31] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 05s) [03:12:09] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:53] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:49] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:33] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:04] (03CR) 10Tim Starling: Unprovision the "swift" dashboard (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) (owner: 10Tim Starling) [04:07:13] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:41] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:57] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:25] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:17] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:01] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:50] (03CR) 10Gergő Tisza: "Do we need this? On production, the vendor backport should do the thing, and I don't think anything else cares about the wmf/ branches. (O" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester) [05:09:03] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:51] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:49] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:39] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:38] (03PS1) 10David Martin: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 [05:37:21] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:45] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:07] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:09] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:33] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:57] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:51] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:31] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:45] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:21] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:49] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:23] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230318T0700) [07:09:11] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:53] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:47] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:03] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:29] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:47] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC [07:55:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC [07:56:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:01:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:12:13] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:25] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:51] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:15] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:19] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:08] (03PS4) 10Acamicamacaraca: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) [10:05:58] (03PS5) 10Acamicamacaraca: SITENAME change of Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) [10:12:42] (03PS2) 10Giuseppe Lavagetto: trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:12] (03PS2) 10Zoranzoki21: [WIP] Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) [12:31:45] (03CR) 10Zoranzoki21: [WIP] Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [12:34:22] (03PS3) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) [13:46:44] !log rsync of xmldata private dir from screen as ariel on dumpsdata1004 to dumpsdata1005, no bandwidth cap [13:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:23] (03PS1) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900763 (https://phabricator.wikimedia.org/T332028) [14:21:54] (03Abandoned) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900763 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [14:26:46] !log rsync of xmldata public dir from screen as ariel on dumpsdata1004 to dumpsdata1005, no bandwidth cap [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:55] (03CR) 10Albertoleoncio: Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [17:42:45] (03PS4) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) [17:43:07] (03CR) 10Zoranzoki21: Remove FlaggedRevs from ptwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900696 (https://phabricator.wikimedia.org/T331762) (owner: 10Zoranzoki21) [17:55:49] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [17:55:49] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [19:12:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:12:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:14:40] apergos: is that your rsync? [19:16:03] doubtful that's the cause, I've run it other days and never had a problem [19:16:06] vgutierrez: [19:17:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:17:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:36:48] apergos: Hi! I pinged you in another chan, dumps1004 is on asw1-a, and https://librenms.wikimedia.org/device/device=160/tab=port/port=21667/ shows a big bw usage [19:37:23] it matches with what we are seeing, nothing is exploding right now but if you could bw-cap the rsync it would be better [19:39:03] ok, I can do that [19:39:19] what's a good cap do you think? [19:40:26] vgutierrez: [19:40:29] er [19:40:46] it is running now at around 3Gbps afaics from librenms (Receiving data from 1005) [19:40:50] elukey: (sorry for the gratuitous ping v gutierrez) [19:41:05] we all love Valentin don't worry, more wikilove to him :D [19:41:10] lol [19:41:34] not super familiar with bw-cap for rsync, even something half to what it is now would be perfect [19:41:43] okey dokey [19:41:52] thanks! [19:46:00] retrying with bandwidth limit 100000 which should be a lot less than you were seeing, please pnig me if I got that wrong, elukey [19:47:42] apergos: yep way better! I think that we are good [19:48:29] ok great, odd that this caused an alert, since I've run rsyncs over the past few weeks as preparation for setting up various hosts, and we haven't had this issue [19:48:39] but anyways, if it's better now, that will do [19:49:03] apergos: it probably tipped over because of other traffic, thanks a lot for the follow up! Have a nice evening :) [19:49:25] you too! [22:34:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:34] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [22:47:54] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 19s) [23:02:03] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) @TheDJ thanks for your comment. These 429 errors "needs to be judged on case b...