[00:11:17] (03open) 10ahecht: Draft: Rewrite to use uWSGI instead of deprecated CGI/Python [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/2 [00:11:40] (03approved) 10ahecht: Draft: Rewrite to use uWSGI instead of deprecated CGI/Python [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/2 [00:12:14] (03update) 10ahecht: Draft: Rewrite to use uWSGI instead of deprecated CGI/Python [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/2 [00:13:48] (03update) 10ahecht: Rewrite to use uWSGI instead of deprecated CGI/Python [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/2 [00:14:02] (03merge) 10ahecht: Rewrite to use uWSGI instead of deprecated CGI/Python [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/2 [00:18:28] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/3 (owner: 10l10n-bot) [00:18:31] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/3 (owner: 10l10n-bot) [00:26:02] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847 (10LucasWerkmeister) 03NEW [00:31:48] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10530888 (10LucasWerkmeister) Same thing happens in a second tool: `lang=shell-session,counterexample tools.lucaswerkmeister-test@t... [00:33:15] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10530889 (10LucasWerkmeister) [00:33:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1039.eqiad.wmnet}' [00:34:44] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10530893 (10LucasWerkmeister) (I’m about to go to bed, but if anyone wants to start looking into this, feel free to muck a... [00:53:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1039.eqiad.wmnet}' [00:53:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1038.eqiad.wmnet}' [01:02:14] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10530966 (10bd808) Something is funky with the python3.11 container: `lang=shell-session tools.lucaswerkmeister-test@tools... [01:05:54] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10530981 (10bd808) I'm starting to think the issue is that NSS is busted on some of the Kubernetes nodes: `lang=shell-sess... [01:12:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1038.eqiad.wmnet}' [01:12:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1037.eqiad.wmnet}' [01:20:26] 10Toolforge (Toolforge iteration 17): [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853 (10Raymond_Ndibe) 03NEW [01:21:26] 10Toolforge (Toolforge iteration 17): [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853#10531009 (10Raymond_Ndibe) [01:21:50] 10Toolforge (Toolforge iteration 17): [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853#10531010 (10Raymond_Ndibe) [01:22:28] 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Team: [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853#10531013 (10Raymond_Ndibe) [01:22:59] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853#10531014 (10Raymond_Ndibe) [01:26:13] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10531015 (10bd808) `lang=shell-session bd808@tools-cumin-1:~$ sudo cumin --force --timeout 30 'name:tools-k8s-worker-*' 'i... [01:26:59] (03open) 10raymond-ndibe: [gitlab-ci] use twine 6.0.1 for now [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/48 (https://phabricator.wikimedia.org/T385853) [01:27:09] (03update) 10raymond-ndibe: [gitlab-ci] use twine 6.0.1 for now [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/48 (https://phabricator.wikimedia.org/T385853) [01:27:26] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-07 [01:27:30] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-07 [01:28:29] (03update) 10raymond-ndibe: [gitlab-ci] use twine 6.0.1 for now [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/48 (https://phabricator.wikimedia.org/T385853) [01:28:31] (03approved) 10raymond-ndibe: [gitlab-ci] use twine 6.0.1 for now [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/48 (https://phabricator.wikimedia.org/T385853) [01:28:35] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-07 [01:28:36] (03merge) 10raymond-ndibe: [gitlab-ci] use twine 6.0.1 for now [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/48 (https://phabricator.wikimedia.org/T385853) [01:28:39] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-07 [01:28:58] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-7 [01:31:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1037.eqiad.wmnet}' [01:31:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1036.eqiad.wmnet}' [01:32:56] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-7 [01:35:36] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [01:36:53] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [01:37:18] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10531036 (10bd808) @Andrew did the needful to reboot tools-k8s-worker-nfs-7.tools.eqiad1.wikimedia.cloud. `id 55130` is wo... [01:39:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1036.eqiad.wmnet}' [01:39:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [01:40:01] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10531039 (10bd808) I think that may have fixed things for you @LucasWerkmeister `lang=shell-session tools.lucaswerkmeiste... [01:40:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [01:40:47] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [01:41:36] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [01:47:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [01:48:36] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=97) on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' [01:53:42] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [gitlab-ci] twine 6.1.0 breaks pypi deploy - https://phabricator.wikimedia.org/T385853#10531056 (10Raymond_Ndibe) 05Open→03In progress [01:56:04] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [01:57:49] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [02:00:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [02:00:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [02:06:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:46:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:29:18] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [03:42:27] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [03:50:17] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [03:51:07] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] (diff_job_runtime_method) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [07:06:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:16:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:25:16] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:30:16] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:45:16] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:47:02] 06cloud-services-team, 10Tool-ranker, 10Toolforge: New (Python?) tool pods failing to start due to: whoami: cannot find name for user ID 54606 - https://phabricator.wikimedia.org/T385847#10531336 (10LucasWerkmeister) 05Open→03Resolved a:03bd808 Thanks, it looks good to me now \o/ Also TIL `webserv... [11:06:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:10:16] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:15:16] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:16:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:34:25] 10Cloud-Services, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871 (10Susannaanas) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more spec... [12:05:04] 06cloud-services-team, 10Cloud-VPS, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871#10531570 (10Susannaanas) [12:08:22] 06cloud-services-team, 10Cloud-VPS, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871#10531572 (10Susannaanas) [12:10:07] 06cloud-services-team, 10Cloud-VPS, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871#10531574 (10fnegri) The instance `hupu2.wikidocumentaries.eqiad1.wikimedia.cloud` was restarted 9 days ago. It's possible the web service must be restarted manually on that host, but... [12:17:56] 06cloud-services-team, 10Cloud-VPS, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871#10531577 (10fnegri) cc @TuukkaH, I see you have an account in the VM so maybe you know how to restart the web service. [12:26:30] 06cloud-services-team, 10Cloud-VPS, 10Wikidocumentaries: Wikidocumentaries is down - https://phabricator.wikimedia.org/T385871#10531581 (10Andrew) The / volume on this host was 100% full. I freed up a small bit of space by removing some old log files and running 'docker container prune'. I then rebooted the... [12:36:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:46:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:47:58] !log root@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudnet.reboot_node for host cloudnet1005.eqiad.wmnet (T384946) [13:49:38] PROBLEM - Host cloudnet1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:52] RECOVERY - Host cloudnet1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [13:52:23] !log root@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudnet.reboot_node (exit_code=0) for host cloudnet1005.eqiad.wmnet (T384946) [14:57:45] 06cloud-services-team, 10Toolforge: [toolsdb] Remove apt pinning and upgrade to latest version - https://phabricator.wikimedia.org/T385885 (10fnegri) 03NEW [14:58:01] 06cloud-services-team, 10Toolforge: [toolsdb] Remove apt pinning and upgrade to latest version - https://phabricator.wikimedia.org/T385885#10531926 (10fnegri) p:05Triage→03Medium [17:30:39] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:35:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-5 is lagging behind the primary, the current lag is 3668 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [17:35:39] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:39:46] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [17:41:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:47:45] (03update) 10raymond-ndibe: [jobs-api] replace load with diff_job runtime method [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T359804) [18:01:31] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [18:25:45] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:30:45] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:00:45] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:05:45] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:07:34] FIRING: DiskSpace: Disk space cloudbackup1002-dev:9100:/srv/cinder-backups 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:08:12] 10Toolforge (Toolforge iteration 17): [jobs-api] move jobs.toolforge.org/* labels to annotations - https://phabricator.wikimedia.org/T385904 (10Raymond_Ndibe) 03NEW [19:08:56] FIRING: SystemdUnitDown: The service unit postgresql@15-main.service is in failed status on host cloudbackup1002-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:09:15] 10Toolforge (Toolforge iteration 17): [jobs-api] move jobs.toolforge.org/* labels to annotations - https://phabricator.wikimedia.org/T385904#10532568 (10Raymond_Ndibe) [19:09:19] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [jobs-api] Refactor before webservice support - https://phabricator.wikimedia.org/T359804#10532569 (10Raymond_Ndibe) [19:09:57] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api] move jobs.toolforge.org/* labels to annotations - https://phabricator.wikimedia.org/T385904#10532570 (10Raymond_Ndibe) a:03Raymond_Ndibe [19:24:55] PROBLEM - Disk space on cloudbackup1002-dev is CRITICAL: DISK CRITICAL - free space: /srv/cinder-backups 0MiB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1002-dev&var-datasource=eqiad+prometheus/ops [19:42:34] RESOLVED: DiskSpace: Disk space cloudbackup1002-dev:9100:/srv/cinder-backups 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:44:55] RECOVERY - Disk space on cloudbackup1002-dev is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1002-dev&var-datasource=eqiad+prometheus/ops [19:55:26] RESOLVED: SystemdUnitDown: The service unit postgresql@15-main.service is in failed status on host cloudbackup1002-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:11:11] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api] move jobs.toolforge.org/* labels to annotations - https://phabricator.wikimedia.org/T385904#10532676 (10Raymond_Ndibe) [20:16:06] 06cloud-services-team, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10532690 (10MoritzMuehlenhoff) >>! In T383723#10530006, @VRiley-WMF wrote: > Thanks! Yeah, we wouldn't need much downtime fo... [20:21:39] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:26:39] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:48:31] 06cloud-services-team, 10Toolforge: toolforge-legacy-redirector: constant failed probes by prometheus - https://phabricator.wikimedia.org/T385908 (10aborrero) 03NEW [20:52:12] 06cloud-services-team, 10Toolforge: toolforge-legacy-redirector: constant failed probes by prometheus - https://phabricator.wikimedia.org/T385908#10532749 (10aborrero) p:05Triage→03Low [21:01:16] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:06:16] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:14:39] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10532943 (10fnegri) Looks like the mariadb process crashed and was restarted automatically by systemctl. When it restarts, mariadb is set to "read_only" for extra safety. Running `SET GLO... [22:18:21] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10532951 (10fnegri) > The replag alert is still present currently. It's now catching up: {F58374443} Link to the Grafana dashboard: https://grafana.wmcloud.org/d/PTtEnEyVk/toolsdb-maria... [22:20:31] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-5 is lagging behind the primary, the current lag is 4205 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [22:20:38] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10532963 (10fnegri) aaaand it's back in sync {F58374454} [22:34:45] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:39:45] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:00:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [23:14:20] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10533117 (10Andrew) this just happened again -- DB is in read-only state and I switched it back to R/W [23:15:31] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10533118 (10Andrew) ` Feb 07 20:25:26 tools-db-4 mysqld[73550]: 2025-02-07 20:25:26 250665 [Note] InnoDB: Cannot close file ./s51434__mixnmatch_p/entry.ibd because of 1 pending operations... [23:15:31] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [23:32:39] 06cloud-services-team, 10Toolforge: [toolsdb] tools-db-4 switched to read-only - https://phabricator.wikimedia.org/T385900#10533156 (10Andrew) I'm thinking the next thing to try is an upgrade to see if we can dodge whatever bug this is. @fnegri if you're free to work on that on a Saturday let me know and I'll...