[00:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:54:13] 10Tool-lexeme-forms, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Dec), 13Patch-For-Review, 07Unplanned-Sprint-Work: translatewiki export for Wikidata Lexeme Forms tries to remove sh-latn translations - https://phabricator.wikimedia.org/T379188#10389348 (10Nikerabbit) [07:54:22] 10Tool-lexeme-forms, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Dec), 13Patch-For-Review, 07Unplanned-Sprint-Work: translatewiki export for Wikidata Lexeme Forms tries to remove sh-latn translations - https://phabricator.wikimedia.org/T379188#10389349 (10Nikerabbit) [08:27:20] 06cloud-services-team, 10Data-Services, 06Data-Persistence: pt-heartbeat updates table even if read_only=1 - https://phabricator.wikimedia.org/T381690#10389362 (10fnegri) p:05Triage→03Low @Marostegui thanks for the info! For ToolsDB, I'm not sure we want to remove `READ_ONLY ADMIN` for root, but having... [09:09:27] 06cloud-services-team, 10Horizon, 10Toolforge: Trove DB full - https://phabricator.wikimedia.org/T381745 (10Magnus) 03NEW [09:11:15] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Trove DB full - https://phabricator.wikimedia.org/T381745#10389471 (10JJMC89) [09:47:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:52:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:22:34] FIRING: TooManyCloudcontrolsDown: No availability for CloudVPS codfw - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudcontrolsDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudcontrolsDown [10:22:34] FIRING: TooManyCloudgwsDown: No availability for CloudVPS codfw - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudgwsDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudgwsDown [10:22:34] FIRING: TooManyCloudnetsDown: No availability for CloudVPS codfw - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudnetsDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudnetsDown [10:24:50] FIRING: TooManyCloudcontrolsDown: #page No availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudcontrolsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudcontrolsDown [10:24:50] FIRING: TooManyCloudnetsDown: #page No availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudnetsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudnetsDown [10:24:50] FIRING: TooManyCloudgwsDown: #page No availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudgwsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudgwsDown [10:24:58] 06cloud-services-team: TooManyCloudcontrolsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381760 (10phaultfinder) 03NEW [10:24:59] 06cloud-services-team: TooManyCloudnetsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381761 (10phaultfinder) 03NEW [10:25:01] 06cloud-services-team: TooManyCloudgwsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381762 (10phaultfinder) 03NEW [10:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:23:29] (03PS1) 10Btullis: Add hadoop/HTTP keytabs for labs hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1101490 (https://phabricator.wikimedia.org/T381087) [11:24:15] (03CR) 10Btullis: [V:03+2 C:03+2] Add hadoop/HTTP keytabs for labs hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1101490 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [11:44:26] (03approved) 10raymond-ndibe: [toolforge-weld] refactor parse_quantity [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/64 (https://phabricator.wikimedia.org/T361120) [11:44:32] (03merge) 10raymond-ndibe: [toolforge-weld] refactor parse_quantity [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/64 (https://phabricator.wikimedia.org/T361120) [13:46:10] (03PS1) 10Btullis: Add a truststore password for the hadoopcluster in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101528 (https://phabricator.wikimedia.org/T381087) [13:46:34] (03CR) 10Btullis: [V:03+2 C:03+2] Add a truststore password for the hadoopcluster in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101528 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:04:34] (03PS1) 10Btullis: Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) [14:05:24] (03PS2) 10Btullis: Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) [14:05:40] (03CR) 10Btullis: [V:03+2 C:03+2] Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:14:40] (03PS1) 10Btullis: Remove hadoop_clusters_secrets for labs from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1101532 (https://phabricator.wikimedia.org/T381087) [14:20:10] (03CR) 10Btullis: [V:03+2 C:03+2] Remove hadoop_clusters_secrets for labs from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1101532 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:26:28] FIRING: InstanceDown: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:31:28] RESOLVED: InstanceDown: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:35:26] (03PS1) 10Btullis: Add HTTP keytabs to hadoop masters in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101534 (https://phabricator.wikimedia.org/T381087) [14:35:42] (03CR) 10Btullis: [V:03+2 C:03+2] Add HTTP keytabs to hadoop masters in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101534 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:42:19] (03PS1) 10Btullis: Revert "Remove hadoop_clusters_secrets for labs from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1101535 [14:42:27] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Remove hadoop_clusters_secrets for labs from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1101535 (owner: 10Btullis) [15:02:41] 10Tool-letaxobot: Upgrade selectors - https://phabricator.wikimedia.org/T381776 (10LD) 03NEW [15:05:15] 10Tool-letaxobot: Epic : Upgrade selectors - https://phabricator.wikimedia.org/T381776#10390389 (10LD) p:05Triage→03Medium [15:08:15] 06cloud-services-team, 10Toolforge (Quota-requests): Increase kubernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902#10390395 (10taavi) +1 [15:10:14] 10Tool-letaxobot: Refine category selection - https://phabricator.wikimedia.org/T381778 (10LD) 03NEW [15:10:25] 10Tool-letaxobot: Refine category selection - https://phabricator.wikimedia.org/T381778#10390422 (10LD) 05Open→03In progress [15:10:54] 10Tool-letaxobot: Refine category selection - https://phabricator.wikimedia.org/T381778#10390424 (10LD) [15:10:55] 10Tool-letaxobot: Epic : Upgrade selectors - https://phabricator.wikimedia.org/T381776#10390425 (10LD) [15:11:44] 10Tool-letaxobot: Refine category selection - https://phabricator.wikimedia.org/T381778#10390427 (10LD) [15:12:14] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Trove DB full - https://phabricator.wikimedia.org/T381745#10390429 (10rook) If I'm understanding this correctly, you have a trove db in a cloud vps project. I believe you should be able to resize the db in horizon by going to databases > instances > select t... [15:22:26] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Trove DB full - https://phabricator.wikimedia.org/T381745#10390464 (10Magnus) It does not, unfortunately: ` Error: Unable to resize volume. Details Quota exceeded for resources: ['volumes']. (HTTP 413) ` [15:35:26] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10Puppet-Core: Facter 4 upgrade removed 'mountpoints' fact, breaking cinderutils::ensure - https://phabricator.wikimedia.org/T381639#10390502 (10jhathaway) p:05Triage→03High [15:55:08] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10390547 (10Jhancock.wm) @Andrew we wanna swap the power supplies. It looks like all three happened on PSU2. We need to shut it off to s... [16:06:13] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Trove DB full - https://phabricator.wikimedia.org/T381745#10390596 (10rook) Sounds good. What's the project name (I don't see a glamtools project)? And what size would you like the db volume? I believe the default is 20 gigabyte [16:11:54] 06cloud-services-team, 10Toolforge (Quota-requests): Increase kubernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902#10390601 (10rook) Memory limit updated to 16Gi ` rook@tools-bastion-13:~$ kubectl sudo edit quota -n tool-multichill resourcequota/tool-multichill edited rook@tools-ba... [16:25:23] 06cloud-services-team, 10Toolforge (Quota-requests): Increase kubernetes quota for tools.multichill - https://phabricator.wikimedia.org/T380902#10390648 (10rook) 05Open→03Resolved a:03rook [17:32:01] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10Puppet-Core: Facter 4 upgrade removed 'mountpoints' fact, breaking cinderutils::ensure - https://phabricator.wikimedia.org/T381639#10390870 (10taavi) 05Open→03Resolved a:03jhathaway [18:21:57] 10Tool-wikiqanda, 06Future-Audiences: [Bug] Investigate issues from internal testing - https://phabricator.wikimedia.org/T380799#10391102 (10Maryana) 05Open→03Resolved [18:22:00] 10Tool-wikiqanda, 06Future-Audiences: Add Slack Support to Bot - https://phabricator.wikimedia.org/T379786#10391104 (10Maryana) 05Open→03Resolved [18:22:06] 10Tool-video-answer-tool, 06Future-Audiences: Make video narration sped up in preview - https://phabricator.wikimedia.org/T379665#10391107 (10Maryana) 05Open→03Resolved a:03Maryana [18:22:10] 10Tool-wikiqanda, 06Future-Audiences: Upgrade to llama3.3 - https://phabricator.wikimedia.org/T381705#10391110 (10Maryana) 05Open→03Resolved a:03Maryana [18:26:24] 10Tool-wikiqanda, 06Future-Audiences: Review data logging practices for past experiments - https://phabricator.wikimedia.org/T380655#10391132 (10Maryana) 05Open→03Resolved a:03Maryana [18:27:25] 10Tool-wikiqanda, 06Future-Audiences: Refactoring btw Slack & Discord - https://phabricator.wikimedia.org/T381795 (10Maryana) 03NEW [19:57:43] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10391423 (10NPRB) I just did but I don't receive an email [20:14:33] 06cloud-services-team: Alert for "TooManyCloud*Down" are firing when they should not - https://phabricator.wikimedia.org/T381807 (10fnegri) 03NEW [20:16:34] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Alert for "TooManyCloud*Down" are firing when they should not - https://phabricator.wikimedia.org/T381807#10391466 (10fnegri) 05Open→03In progress p:05Triage→03High a:03fnegri A first fix was attempted in https://gerrit.wikimedia.org/r/c/operati... [20:23:08] 06cloud-services-team: TooManyCloudgwsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381762#10391483 (10fnegri) 05Open→03Resolved a:03fnegri False alarm, caused by {T381807}. [20:23:10] 06cloud-services-team: TooManyCloudnetsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381761#10391490 (10fnegri) False alarm, caused by {T381807}. [20:23:17] 06cloud-services-team: TooManyCloudcontrolsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381760#10391494 (10fnegri) False alarm, caused by {T381807}. [20:36:03] 06cloud-services-team: TooManyCloudnetsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381761#10391532 (10fnegri) 05Open→03Resolved a:03fnegri [20:36:16] 06cloud-services-team: TooManyCloudcontrolsDown # page No availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T381760#10391534 (10fnegri) 05Open→03Resolved a:03fnegri [20:38:58] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge: ToolsDB: simplify volume chain - https://phabricator.wikimedia.org/T335593#10391536 (10fnegri) 05In progress→03Resolved > tools-db-basesnapshot1 is currently stuck in Deleting This deletion completed successfully over the weekend, and after that... [23:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks