[00:04:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:37:35] (03PS1) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965572 [00:39:06] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965572 (owner: 10TrainBranchBot) [00:46:29] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:41] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965572 (owner: 10TrainBranchBot) [00:54:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965572 (owner: 10TrainBranchBot) [01:36:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52939 and previous config saved to /var/cache/conftool/dbconfig/20231014-013648-arnaudb.json [01:36:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:51:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P52940 and previous config saved to /var/cache/conftool/dbconfig/20231014-015154-arnaudb.json [02:07:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P52941 and previous config saved to /var/cache/conftool/dbconfig/20231014-020701-arnaudb.json [02:19:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52942 and previous config saved to /var/cache/conftool/dbconfig/20231014-022208-arnaudb.json [02:22:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [02:22:13] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:22:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [02:29:46] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [02:30:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:43] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:35:03] PROBLEM - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:35:57] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:59] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:17:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T348903 (10phaultfinder) [07:02:29] (Temperature) firing: Inlet Temp issue on cp5029:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=cp5029 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:03:29] (Temperature) firing: Inlet Temp issue on cp5024:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=cp5024 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:03:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:29] (Temperature) firing: (8) Inlet Temp issue on cp5025:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:08:29] (Temperature) firing: (7) Inlet Temp issue on cp5018:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:13:29] (Temperature) firing: (8) Inlet Temp issue on cp5017:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:23:29] (Temperature) firing: Inlet Temp issue on lvs5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:25:30] (Temperature) firing: (3) Inlet Temp issue on ganeti5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:27:29] (Temperature) firing: Inlet Temp issue on dns5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=dns5004 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:28:29] (Temperature) firing: (3) Inlet Temp issue on lvs5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:30:29] (Temperature) firing: (4) Inlet Temp issue on ganeti5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:32:29] (Temperature) firing: (2) Inlet Temp issue on dns5003:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:05:30] (Temperature) resolved: (4) Inlet Temp issue on ganeti5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:07:29] (Temperature) resolved: (2) Inlet Temp issue on dns5003:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:07:29] (Temperature) resolved: (8) Inlet Temp issue on cp5025:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:08:30] (Temperature) resolved: (3) Inlet Temp issue on lvs5004:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:08:30] (Temperature) resolved: (8) Inlet Temp issue on cp5017:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:27:59] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:05] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:43:17] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:25] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:27] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:15:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:15:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T343198)', diff saved to https://phabricator.wikimedia.org/P52943 and previous config saved to /var/cache/conftool/dbconfig/20231014-091542-arnaudb.json [09:15:47] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:15:55] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:42] (SystemdUnitFailed) firing: user@499.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:33] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: user@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:03] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:23] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:42] (SystemdUnitFailed) resolved: user@499.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:47:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:48:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Beao) Am I experiencing the same problem? I'm getting this error using the... [13:19:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.573933116887395s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:53:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.272967359846409s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.194747538184551s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.087417148240994s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [17:08:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:17:25] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:32:04] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:32:31] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:33:23] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:33:49] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:34:08] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:34:45] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:36:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza) Ignore my request. Resolved. [17:36:44] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza) 05Open→03Resolved [17:59:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T343198)', diff saved to https://phabricator.wikimedia.org/P52944 and previous config saved to /var/cache/conftool/dbconfig/20231014-175936-arnaudb.json [17:59:41] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:12:00] (NodeTextfileStale) firing: Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:14:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P52945 and previous config saved to /var/cache/conftool/dbconfig/20231014-181442-arnaudb.json [18:29:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P52946 and previous config saved to /var/cache/conftool/dbconfig/20231014-182949-arnaudb.json [18:30:58] !log starting Cassandra decommission of restbase1016-a — T328490 [18:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:03] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [18:42:00] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:44:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T343198)', diff saved to https://phabricator.wikimedia.org/P52947 and previous config saved to /var/cache/conftool/dbconfig/20231014-184455-arnaudb.json [18:44:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:45:04] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:45:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:45:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T343198)', diff saved to https://phabricator.wikimedia.org/P52948 and previous config saved to /var/cache/conftool/dbconfig/20231014-184517-arnaudb.json [18:53:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:47:27] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:23] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.544195340921269s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:23:08] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Pppery) [22:16:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.645615822643996s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:01] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:53:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable