[00:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.753924627959494s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.882763023229455s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.647351074955542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965573 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965573 (owner: 10TrainBranchBot) [00:42:51] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965573 (owner: 10TrainBranchBot) [01:02:31] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 5.4888519858194575s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:24:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:26:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:15:03] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Kizule) See {T348688} as well. [02:29:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:31:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:38:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:01] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:03:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T343198)', diff saved to https://phabricator.wikimedia.org/P52949 and previous config saved to /var/cache/conftool/dbconfig/20231015-030828-arnaudb.json [03:08:34] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:23:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P52950 and previous config saved to /var/cache/conftool/dbconfig/20231015-032335-arnaudb.json [03:38:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P52951 and previous config saved to /var/cache/conftool/dbconfig/20231015-033841-arnaudb.json [03:53:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T343198)', diff saved to https://phabricator.wikimedia.org/P52952 and previous config saved to /var/cache/conftool/dbconfig/20231015-035347-arnaudb.json [03:53:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [03:53:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [03:54:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [03:54:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T343198)', diff saved to https://phabricator.wikimedia.org/P52953 and previous config saved to /var/cache/conftool/dbconfig/20231015-035420-arnaudb.json [04:57:51] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:51] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:56] (03PS1) 10Gergő Tisza: [beta] Make temp user config SUL-friendly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) [06:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.9471948942024726s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.0401057043859936s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:42:01] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231015T0700) [07:03:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:34:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.9157141057122975s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.0145329805301717s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:29:17] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:53] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.1827684619855s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.1577447527022526s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:21:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:01] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:03:16] !log hashar@deploy2002 Started deploy [integration/docroot@096f637]: (no justification provided) [11:03:21] !log hashar@deploy2002 Finished deploy [integration/docroot@096f637]: (no justification provided) (duration: 00m 05s) [11:03:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T343198)', diff saved to https://phabricator.wikimedia.org/P52954 and previous config saved to /var/cache/conftool/dbconfig/20231015-121446-arnaudb.json [12:14:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:15:13] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P52955 and previous config saved to /var/cache/conftool/dbconfig/20231015-122953-arnaudb.json [12:32:03] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 234.88 ms [12:45:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P52956 and previous config saved to /var/cache/conftool/dbconfig/20231015-124459-arnaudb.json [13:00:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T343198)', diff saved to https://phabricator.wikimedia.org/P52957 and previous config saved to /var/cache/conftool/dbconfig/20231015-130005-arnaudb.json [13:00:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:00:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:00:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:00:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52958 and previous config saved to /var/cache/conftool/dbconfig/20231015-130027-arnaudb.json [14:30:37] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:30:39] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [14:31:51] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:31:56] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [14:31:59] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:32:01] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:34:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:25] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:38:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:01] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:28] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bugreporter) [15:05:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.402354941304805s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:05:20] 10SRE-swift-storage: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Bugreporter) [15:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 5.0223453889088265s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:30:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.910577901905123s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:35:31] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.3609046602497177s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:58:58] 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T348929 (10Tokolazt13) [16:00:27] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Kizule) See {T348586} as well. [16:04:17] 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T348929 (10RhinosF1) What is QGIS? How does it link to Wikimedia? [17:03:09] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Frostly) [17:43:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:44:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:58:41] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Mike_Peel) I'm doing a bulk upload from LIghtroom this eve and am... [17:59:41] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:42:16] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:53:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:44] 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T348929 (10Aklapper) 05Open→03Invalid Hi @Tokolazt13, thanks for taking the time to report this. The three fields above are not filled out, so for now I am going to decline this ticket. Please see https://wikitech.wikime... [18:55:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:10:42] !log starting Cassandra decommission of restbase1016-b — T328490 [19:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [19:44:15] (03Abandoned) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [19:46:27] (03CR) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480760 (owner: 10Paladox) [20:27:33] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10Aklapper) [20:32:20] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Beao) As I mentioned in T341007, get this error too and have collected them here: https://commons... [21:38:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52959 and previous config saved to /var/cache/conftool/dbconfig/20231015-213855-arnaudb.json [21:39:07] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:40:56] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10TheresNoTime) ==== Error ==== * mwversion: 1.41.0-wmf.30 * reqId: 6d4ac30d-328c-4adf-99c7-74ca434ab1df * [[ https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=(time:(fr... [21:43:14] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10TheresNoTime) [21:43:53] is it just me, or have there been a spike in Swift problems lately..? [21:44:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:50:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:54:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P52960 and previous config saved to /var/cache/conftool/dbconfig/20231015-215401-arnaudb.json [22:00:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:01:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:02:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:03:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P52961 and previous config saved to /var/cache/conftool/dbconfig/20231015-220907-arnaudb.json [22:15:01] 10SRE-swift-storage, 10MinervaNeue: Image thumbs on Persian Wikipedia are broken when you click on them - https://phabricator.wikimedia.org/T348939 (10TheresNoTime) [22:21:21] 10SRE-swift-storage, 10MinervaNeue: Image thumbs on Persian Wikipedia are broken when you click on them - https://phabricator.wikimedia.org/T348939 (10TheresNoTime) nb. https://fa.m.wikipedia.org/wiki/%DA%AF%D8%B1%D8%A8%D9%87#/media/File:Cat_poster_1.jpg?uselang=fa does not work, but https://fa.m.wikipedia.org... [22:24:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52962 and previous config saved to /var/cache/conftool/dbconfig/20231015-222414-arnaudb.json [22:24:16] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:24:19] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:24:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:24:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52963 and previous config saved to /var/cache/conftool/dbconfig/20231015-222435-arnaudb.json [22:30:53] 10SRE-swift-storage, 10MinervaNeue, 10Thumbor: Image thumbs on Persian Wikipedia are broken when you click on them - https://phabricator.wikimedia.org/T348939 (10TheresNoTime) Looks to fail around here, so tagging #thumbor * mwversion: 1.41.0-wmf.30 * reqId: 9ab3256d-5f15-4483-9596-e06981716ce2 * [[ https:... [22:42:16] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:53:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:09:22] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Yodin) I've tried several times to upload [[ https://commons.wikimedia.org/w...