[00:00:37] (03PS2) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [00:01:17] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [00:01:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:03] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:12:15] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6453/IPv6: Idle - Tata, AS6453/IPv4: Idle - Tata https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:13:23] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:13:43] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 61 probes of 710 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:19:45] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 7 probes of 710 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:25:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10Legoktm) [00:30:22] 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Legoktm) I'm tagging this with #wikimedia-incident-actionable because somehow the above cable addition caused https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-1... [00:31:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:47] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Legoktm) See also the thread https://lists.wikimedia.org/hyperkitty/list/listadmins@lists.wikimedia.org/thread/D43SYRD3FUUJZWXAMH3KVRF46FABEWZ2/#D43SYRD3FUUJZWXAMH3KVRF46FABEWZ2 which discus... [00:45:16] (03PS3) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [00:46:45] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [00:51:35] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:52:43] (03PS4) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [00:52:45] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:53:13] Telia fixed the fiber [00:53:16] it looks [00:54:11] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [00:56:35] (03PS5) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [00:58:03] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [01:12:40] signing off, have a good weekend [01:14:09] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:25:29] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:38:37] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:40:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:27] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:11] 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Krinkle) This seems especially important because the archive displays emails in plaintext, and the default way that HTML emails are converted to plaintext (by Google and Fast... [02:55:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:29] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:53:41] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:55:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:31] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:31] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:21] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:08:33] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:10:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:39] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:17] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:23:25] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:30:45] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:40:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:41] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:53:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:55:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:13:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:18:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:25:15] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:28:46] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:29:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:33:46] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:36:48] 10SRE, 10Performance Issue: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10Tholme) No, haven't noticed this lately either. Agree that this probably can be closed. [10:38:25] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:05:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:10:07] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:16:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:32:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:37:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:42:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:46:37] (03PS2) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) [11:48:00] 10SRE, 10Infrastructure-Foundations, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10Perryprog) >>! In T275234#7361977, @Perryprog wrote: > This issue has been reported on Znuny on Ticket#2021091710000804, so it's safe to a... [11:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:02:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:07:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:12:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:22:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:25:11] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:40:13] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:53:23] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:25:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:31:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:40:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:46:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:56:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:06:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:41:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:44:14] 10SRE, 10Infrastructure-Foundations, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10CDanis) Per NEL data it looks like that this issue was fixed on Sept 17th! {F34707922} These are the reported TCP resets for Tata Sky Br... [14:52:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:55:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:07] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:01:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:15] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:14:34] (03PS1) 10Gerrit maintenance bot: Add pwn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) [15:14:45] (03PS1) 10Gerrit maintenance bot: Add ami to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) [15:15:32] (03CR) 10Urbanecm: [C: 03+1] Add pwn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733140 (https://phabricator.wikimedia.org/T292415) (owner: 10Gerrit maintenance bot) [15:15:35] (03CR) 10Urbanecm: [C: 03+1] Add ami to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/733141 (https://phabricator.wikimedia.org/T292414) (owner: 10Gerrit maintenance bot) [15:24:07] (03PS1) 10Urbanecm: Initial configuration for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733142 (https://phabricator.wikimedia.org/T292415) [15:27:33] (03PS1) 10Urbanecm: Initial configuration for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733143 (https://phabricator.wikimedia.org/T292414) [15:36:01] (03PS1) 10Urbanecm: Initial configuration for lmowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733145 (https://phabricator.wikimedia.org/T291390) [15:45:16] !log Start server-side upload for 1 video file (T289781), testing whether T291137 is still an issue [15:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:24] T291137: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 [15:45:24] T289781: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T289781 [15:55:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:30] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Urbanecm) Just got this: ` [urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comm... [16:40:23] !log restarting blazegraph on wdqs1004 and wdqs1006 (free allocators alert) [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [17:10:07] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:10:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:15] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:10:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:14:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:25:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:13] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:38:25] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:40:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:13] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:53:09] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:55:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:59:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:01:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:22] 10SRE, 10Performance Issue: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10jhsoby) 05Open→03Resolved a:03jhsoby [23:10:09] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:23:07] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers