[00:01:02] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [00:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [00:06:49] !log raymondndibe@wmf3402 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [00:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [00:07:15] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [00:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:07:55] (03approved) 10raymond-ndibe: d/changelog: bump to 1.6.3 [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/58 [00:07:58] (03merge) 10raymond-ndibe: d/changelog: bump to 1.6.3 [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/58 [00:14:27] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [00:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:20:29] (03approved) 10raymond-ndibe: d/changelog: bump to 16.1.5 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/73 [00:20:43] (03merge) 10raymond-ndibe: d/changelog: bump to 16.1.5 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/73 [00:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:24:25] (03open) 10samwilson: Switch to codex-design-tokens [toolforge-repos/wdlocator] - 10https://gitlab.wikimedia.org/toolforge-repos/wdlocator/-/merge_requests/30 [01:28:58] (03merge) 10samwilson: Switch to codex-design-tokens [toolforge-repos/wdlocator] - 10https://gitlab.wikimedia.org/toolforge-repos/wdlocator/-/merge_requests/30 [04:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:25:28] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950 (10Bugreporter) 03NEW [04:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:20:20] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950#10185296 (10Peachey88) Has this been raised on wiki at all for consenus? [09:20:21] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950#10185323 (10taavi) >>! In T375950#10185296, @Peachey88 wrote: > Has this been raised on wiki or discussed with anyone at all for consenus? For Wikitech these kinds of t... [09:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:16:53] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001#10185342 (10LucasWerkmeister) Oh wait, I totally forgot I signed an NDA for T314527 as well. (And indeed L3 tells me I’ve signed it.) That probably means this task is done? [11:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:32:40] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10185344 (10taavi) [11:32:47] 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10185345 (10taavi) [11:33:08] 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10185346 (10taavi) [11:33:19] 06cloud-services-team, 10Cloud-VPS, 06SRE: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10185347 (10taavi) [11:33:29] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10185348 (10taavi) [11:33:34] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10185349 (10taavi) [11:33:41] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10185350 (10taavi) [11:33:47] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10185351 (10taavi) [12:21:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334#10185391 (10taavi) [12:21:56] 06cloud-services-team, 10Cloud-VPS: cloudgw improvements - https://phabricator.wikimedia.org/T347469#10185392 (10taavi) [12:22:06] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: cloudgw: replace keepalived with BGP - https://phabricator.wikimedia.org/T347687#10185393 (10taavi) [12:22:36] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#10185394 (10taavi) [12:22:45] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [cloudceph] Improve downtime when a switch goes down - https://phabricator.wikimedia.org/T375204#10185395 (10taavi) [12:22:53] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10185396 (10taavi) [12:23:00] 14cloud-services-team (Kanban), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: Setup cloudcephosd10[25-34] into the ceph eqiad cluster - https://phabricator.wikimedia.org/T314870#10185397 (10taavi) [12:23:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 14Toolforge (Toolforge iteration 14), 05Cloud-Services-Origin-Team, and 3 others: [maintain-dbusers] Generate prometheus metrics - https://phabricator.wikimedia.org/T332955#10185398 (10taavi) [12:24:48] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10185399 (10taavi) [12:25:26] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: 2024-09-10: hardware error on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374467#10185400 (10taavi) [12:25:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [cloudinfra] Upgrade cloudinfra-idp-* to bookworm - https://phabricator.wikimedia.org/T373840#10185401 (10taavi) [12:26:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nova-api,cloudrabbit] Connectivity issues from all cloudcontrols to all cloudrabbit nodes - https://phabricator.wikimedia.org/T356621#10185402 (10taavi) [12:26:23] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295#10185406 (10taavi) [12:27:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354294#10185404 (10taavi) →14Duplicate dup:03T354295 [12:27:01] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud] puppet failing to run - https://phabricator.wikimedia.org/T353048#10185407 (10taavi) [12:27:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635#10185408 (10taavi) [12:28:32] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [webservice shell] Allow a user to delete/stop all running shell pods - https://phabricator.wikimedia.org/T349733#10185414 (10taavi) [12:28:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Enable disk failure prediciton - https://phabricator.wikimedia.org/T349694#10185415 (10taavi) [12:28:45] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635#10185410 (10taavi) Almost a year... [12:28:48] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud] puppet failing to run - https://phabricator.wikimedia.org/T353048#10185412 (10taavi) 05Open→03Resolved Pleas... [12:29:34] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Infrastructure-Foundations: Remove wmcs-admin access from production cumin hosts - https://phabricator.wikimedia.org/T347979#10185417 (10taavi) What's the future of this now that {T347977} was declined? [12:32:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned, 13Patch-For-Review: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885#10185421 (10taavi) [12:36:51] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Horizon, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [horizon] Log in timing out due to nutcracker being stopped - https://phabricator.wikimedia.org/T333561#10185426 (10taavi) [12:39:55] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Horizon, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [horizon] Log in timing out due to nutcracker being stopped - https://phabricator.wikimedia.org/T333561#10185428 (10taavi) 05Open→03Resolved a:03dcaro Closing since this hasn... [12:39:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network - https://phabricator.wikimedia.org/T329778#10185432 (10taavi) [12:40:04] 06cloud-services-team, 10Cloud-VPS: codfw1dev: rabbitmq is not working because some auth failures - https://phabricator.wikimedia.org/T374002#10185433 (10taavi) [12:40:12] 06cloud-services-team, 10Cloud-VPS: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815#10185434 (10taavi) [12:40:22] 06cloud-services-team, 10Cloud-VPS: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10185435 (10taavi) [12:41:02] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q2): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10185436 (10taavi) [12:45:24] 06cloud-services-team, 10Data-Services, 06Infrastructure-Foundations, 10netops: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543#10185443 (10taavi) [12:45:45] 06cloud-services-team, 10Cloud-VPS, 10Wikidata, 10Wikidata Analytics: Delete the wmdeanalytics Cloud VPS project - https://phabricator.wikimedia.org/T371696#10185439 (10taavi) Please delete all resources (especially the VMs) from the project first before requesting that the project be deleted. [12:45:46] 06cloud-services-team, 10Cloud-VPS, 10Wikidata, 10Wikidata Analytics: Delete the wmdeanalytics Cloud VPS project - https://phabricator.wikimedia.org/T371696#10185441 (10taavi) [12:45:51] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716#10185444 (10taavi) [12:46:12] 06cloud-services-team, 10Cloud-VPS: openstack: consider automating DB grants - https://phabricator.wikimedia.org/T346619#10185445 (10taavi) [12:47:51] 06cloud-services-team: WMCS: hundred of phabricator tickets were created for some alerts - https://phabricator.wikimedia.org/T333315#10185446 (10taavi) 05Open→03Resolved Boldly closing this rather old task. [12:48:29] 06cloud-services-team, 10Cloud-VPS: codfw1dev: evaluate if extra floating IPs are required today - https://phabricator.wikimedia.org/T329041#10185450 (10taavi) [12:49:49] 06cloud-services-team, 10wikitech.wikimedia.org: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10185456 (10taavi) [12:50:26] 06cloud-services-team, 10Cloud-VPS: Move ceph keyring/caps management into puppet if possible - https://phabricator.wikimedia.org/T275556#10185457 (10taavi) [12:50:56] 06cloud-services-team, 10Cloud-VPS: Allow encrypting connection from the shared Cloud VPS web proxy to the backend - https://phabricator.wikimedia.org/T274386#10185458 (10taavi) [12:51:13] 06cloud-services-team, 10Cloud-VPS: [backups] Periodically cleanup non-handled backups - https://phabricator.wikimedia.org/T273723#10185459 (10taavi) [12:51:17] 06cloud-services-team, 10Cloud-VPS: [ceph][rbd] Periodically cleanup dangling snapshots - https://phabricator.wikimedia.org/T273720#10185460 (10taavi) [12:51:31] 06cloud-services-team, 10Cloud-VPS: [wmcs][prometheus] Spread the stats gathering through time - https://phabricator.wikimedia.org/T271793#10185461 (10taavi) [12:51:35] 06cloud-services-team, 10Cloud-VPS: OpenStack services should use system users to talk to Keystone - https://phabricator.wikimedia.org/T273150#10185462 (10taavi) [12:51:47] 06cloud-services-team, 10Cloud-VPS: codfw1dev: evaluate if extra floating IPs are required today - https://phabricator.wikimedia.org/T329041#10185452 (10taavi) 05Open→03Resolved a:03taavi https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007566 consolidated the Puppet code so boldly closing. [12:52:32] 06cloud-services-team, 10Cloud-VPS: CamelCase vs. VPS instance naming - https://phabricator.wikimedia.org/T176757#10185467 (10taavi) [12:52:46] 06cloud-services-team, 10Cloud-VPS: DNS labsaliaser (mostly) no longer needed on Neutron - https://phabricator.wikimedia.org/T207859#10185468 (10taavi) [12:53:00] 06cloud-services-team: cloud NFS: try out stretch <-> buster DRBD replication and other migration stuff - https://phabricator.wikimedia.org/T291068#10185464 (10taavi) 05Stalled→03Invalid This is obsolete on the NFS-on-VMs world. [12:54:27] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: puppet: autoinstall: partman/raid1-lvm-xfs-nova.cfg requires d-i interaction - https://phabricator.wikimedia.org/T215505#10185470 (10taavi) 05Open→03Invalid This recipe no longer exists so closing. [12:54:28] 06cloud-services-team, 10Data-Services: labstore: Re-evaluate traffic shaping settings - https://phabricator.wikimedia.org/T218338#10185473 (10taavi) [12:54:30] 06cloud-services-team, 10wikitech.wikimedia.org: Check potential legacy classes on labweb - https://phabricator.wikimedia.org/T192382#10185474 (10taavi) [12:58:19] 06cloud-services-team: IP header IN errors on cloud networks - https://phabricator.wikimedia.org/T219952#10185483 (10taavi) 05Open→03Resolved Looking at that panel this doens't seem to be an issue anymore. [12:58:29] 06cloud-services-team, 10Cloud-VPS: OpenStack should track project metadata - https://phabricator.wikimedia.org/T200618#10185476 (10taavi) Openstack-browser now shows the project description which tends to have some details for newer projects at least. Is that good enough to declare this task done? cc @aborre... [12:58:31] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Collect access metrics from cloud-vps web proxy - https://phabricator.wikimedia.org/T371382#10185480 (10taavi) Dupe of {T103726} (from 2015)? [12:59:43] 06cloud-services-team, 10Cloud-VPS: nova-fullstack: add cleanup checking - https://phabricator.wikimedia.org/T235129#10185488 (10taavi) [13:00:20] 06cloud-services-team: debian installer prompts in cloudvirt servers partman configuration - https://phabricator.wikimedia.org/T212855#10185486 (10taavi) 05Open→03Resolved Boldly assuming this is no longer an issue. [13:02:21] 06cloud-services-team, 10Cloud-VPS: nova-fullstack: add cleanup checking - https://phabricator.wikimedia.org/T235129#10185490 (10taavi) 05Open→03Resolved a:03Andrew This looks complete? Please re-open if not. [13:02:51] 06cloud-services-team, 10Data-Services: Document the new NFS setup on cloudstore1008/9 - https://phabricator.wikimedia.org/T224510#10185494 (10taavi) 05Open→03Invalid Assuming this is obsolete in the NFS-on-VMs world. [13:16:34] 06cloud-services-team, 10Cloud-VPS: bad failure cases for wmcs custom puppet enc - https://phabricator.wikimedia.org/T262350#10185505 (10taavi) [13:16:45] 06cloud-services-team, 10Cloud-VPS: Learn about Barbican - https://phabricator.wikimedia.org/T263680#10185506 (10taavi) [13:17:35] 06cloud-services-team, 10Tools, 06Privacy Engineering, 07Privacy: fontcdn.toolforge.org loads assets for detail views directly from google - https://phabricator.wikimedia.org/T258232#10185501 (10taavi) 05Open→03Resolved a:03taavi Fixed with https://github.com/toolforge/fontcdn/commit/ddb5c1527d29... [13:17:47] 06cloud-services-team, 10Cloud-VPS, 07Epic: Build Prometheus service for use by all Cloud VPS projects and their instances - https://phabricator.wikimedia.org/T266050#10185508 (10taavi) 05Open→03Resolved No, please create new tasks to track future work. This task was for the initial buildout that is... [13:18:33] 06cloud-services-team, 10Cloud-VPS, 07Epic: Build Prometheus service for use by all Cloud VPS projects and their instances - https://phabricator.wikimedia.org/T266050#10185512 (10taavi) [13:18:33] 10Cloud-VPS: metricsinfra: Build out default alert rules - https://phabricator.wikimedia.org/T288168#10185513 (10taavi) [13:18:34] 10Cloud-VPS: Deploy prometheus pushgateway to metricsinfra - https://phabricator.wikimedia.org/T287349#10185515 (10taavi) [13:18:36] 10Cloud-VPS: Enable self-service Prometheus configuration management for project administrators - https://phabricator.wikimedia.org/T284993#10185516 (10taavi) [13:18:37] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 13Patch-For-Review, 10Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053#10185514 (10taavi) [13:18:40] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Create alerts for bastion hosts - Usage and latency - https://phabricator.wikimedia.org/T186552#10185517 (10taavi) [13:19:59] 06cloud-services-team, 10Data-Services, 10Tools: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's - https://phabricator.wikimedia.org/T267992#10185523 (10taavi) [13:20:01] 06cloud-services-team, 10Cloud-VPS: [ceph][grafana] Add a graph on misplaced objects/recovery rate with estimations on recovery time - https://phabricator.wikimedia.org/T268729#10185524 (10taavi) [13:20:13] 06cloud-services-team, 10Cloud-VPS: [ceph][icinga] Use health details when checking the alarm and include the info in icinga alerts - https://phabricator.wikimedia.org/T268885#10185525 (10taavi) [13:20:43] 06cloud-services-team, 10Cloud-VPS: [ceph] Find a way to use config management for ceph config commands - https://phabricator.wikimedia.org/T273862#10185528 (10taavi) [13:21:20] 06cloud-services-team, 10Cloud-VPS: ceph: check time sync setup - https://phabricator.wikimedia.org/T275860#10185530 (10taavi) [13:21:22] 06cloud-services-team, 10Cloud-VPS: Investigate cinder volume multi-attach - https://phabricator.wikimedia.org/T275083#10185531 (10taavi) [13:21:35] 06cloud-services-team, 10Cloud-VPS: Update OpenStack nova API endpoint - https://phabricator.wikimedia.org/T275084#10185532 (10taavi) [13:21:56] 06cloud-services-team, 10Toolforge: toolschecker: naming refresh - https://phabricator.wikimedia.org/T277542#10185533 (10taavi) [13:22:12] 06cloud-services-team, 10Cloud-VPS: neutron: investigate using IRC conntrack helpers to improve IRC bots connectiviy - https://phabricator.wikimedia.org/T277549#10185534 (10taavi) [13:22:23] 06cloud-services-team, 10Cloud-VPS: cloud: neutron l3 agent: improve failover handling - https://phabricator.wikimedia.org/T268335#10185535 (10taavi) [13:22:26] 06cloud-services-team, 10Cloud-VPS: [ceph][icinga] Use health details when checking the alarm and include the info in icinga alerts - https://phabricator.wikimedia.org/T268885#10185527 (10taavi) Is this relevant in the Alertmanager world? [13:22:56] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: [toolsbeta] Make sure that /etc/security/access.conf is managed by puppet on all machines - https://phabricator.wikimedia.org/T267138#10185519 (10taavi) 05Open→03Resolved a:03taavi https://gerrit.wikimedia.org/r/c/operations/puppet/+/965461 should ha... [13:23:18] 06cloud-services-team, 10Cloud-VPS: ceph.automation: Add non-host bound icinga downtimes to the upgrade scripts - https://phabricator.wikimedia.org/T281336#10185541 (10taavi) [13:25:26] 06cloud-services-team, 10Toolforge: tools.grafana: We are getting CORS errors when using the prometheus data source - https://phabricator.wikimedia.org/T280512#10185536 (10taavi) 05Open→03Resolved a:03taavi [13:25:27] 06cloud-services-team, 10Cloud-VPS: ceph.operationalization: Refactor the current puppet code to allow per-cluster configurations (as opposed to per-DC as it does currently) - https://phabricator.wikimedia.org/T281250#10185539 (10taavi) [13:25:28] 06cloud-services-team, 10Cloud-VPS: ceph: device health metrics are failing to get scraped - https://phabricator.wikimedia.org/T281350#10185544 (10taavi) [13:25:31] 06cloud-services-team, 10Data-Services: Clean up references to labsdb* hosts in puppet repo - https://phabricator.wikimedia.org/T282662#10185545 (10taavi) [13:25:32] 06cloud-services-team, 10Cloud-VPS: [cloud-ceph-performance-tests] allow linking to a specific report - https://phabricator.wikimedia.org/T283610#10185546 (10taavi) [13:25:34] 06cloud-services-team, 10Cloud-VPS: openstack: alert for cloudvirts without aggregate or with unexpected set of them - https://phabricator.wikimedia.org/T284747#10185547 (10taavi) [13:25:57] 06cloud-services-team: Create a standup bot for wikimedia-cloud-private - https://phabricator.wikimedia.org/T288251#10185549 (10taavi) still relevant with the -cloud-daily channel? [13:25:58] 06cloud-services-team, 10Cloud-VPS: ceph.automation: Add non-host bound icinga downtimes to the upgrade scripts - https://phabricator.wikimedia.org/T281336#10185543 (10taavi) Is this still relevant with most things in Alertmanager now? [13:26:06] 06cloud-services-team, 10Cloud-VPS: cloudvirt1025-1030 overheating issues - https://phabricator.wikimedia.org/T289159#10185553 (10taavi) [13:26:06] 06cloud-services-team, 10Cloud-VPS: Figure out how to delete Glance images - https://phabricator.wikimedia.org/T289502#10185554 (10taavi) [13:26:19] 06cloud-services-team, 10Cloud-VPS: Error parsing "/var/lib/prometheus/node.d/node_cloudvirt_libvirt_stats.prom" - https://phabricator.wikimedia.org/T289563#10185555 (10taavi) [13:26:25] 06cloud-services-team, 10Cloud-VPS: We almost never actually free up space from deleted VMs - https://phabricator.wikimedia.org/T289623#10185556 (10taavi) [13:27:11] 06cloud-services-team, 10Toolforge: [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities) - https://phabricator.wikimedia.org/T267140#10185551 (10taavi) Anything left to do here? I think most toolforge things have cookbooks now. [13:28:29] 06cloud-services-team, 10Cloud-VPS: Offer additional images to choose from: in particular, our own immediate wish is for Ubuntu 20.04 - https://phabricator.wikimedia.org/T289495#10185558 (10taavi) 05Open→03Declined Boldly closing, this can be revisited if someone comes up for an actual reason why they... [13:28:59] 06cloud-services-team: The labs-monitoring dashboard does not exist anymore, replace/remove the alert pointing to it - https://phabricator.wikimedia.org/T290306#10185561 (10taavi) 05Open→03Resolved a:03dcaro Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/916158 it seems. [13:29:15] 06cloud-services-team, 10Cloud-VPS: monitor ldap functionality from within cloud-vps - https://phabricator.wikimedia.org/T292205#10185564 (10taavi) [13:30:09] 06cloud-services-team, 10Data-Services: NFS-on-ceph: monitoring - https://phabricator.wikimedia.org/T301279#10185569 (10taavi) [13:30:23] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: cloud: review lldp setup on hypervisors and VMs - https://phabricator.wikimedia.org/T304504#10185570 (10taavi) [13:30:52] 06cloud-services-team, 10Cloud-VPS, 10wikitech.wikimedia.org: consider eliminating labweb/cloudweb hardware servers - https://phabricator.wikimedia.org/T305233#10185571 (10taavi) [13:30:53] 06cloud-services-team, 10Cloud-VPS: cloudswift @ codfw1dev need to be reworked to match eqiad1 setup - https://phabricator.wikimedia.org/T292427#10185566 (10taavi) 05Open→03Invalid cloudswift ended up becoming cloudlb and this is done with them. [13:31:06] 06cloud-services-team, 10Cloud-VPS, 10wikitech.wikimedia.org: consider eliminating labweb/cloudweb hardware servers - https://phabricator.wikimedia.org/T305233#10185573 (10taavi) [13:31:10] 06cloud-services-team, 10wikitech.wikimedia.org: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773#10185575 (10taavi) [13:31:10] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#10185574 (10taavi) [13:31:31] 06cloud-services-team, 10Horizon, 10Striker, 10wikitech.wikimedia.org: consider eliminating labweb/cloudweb hardware servers - https://phabricator.wikimedia.org/T305233#10185576 (10taavi) [13:32:50] 06cloud-services-team, 10Tools: Webservices broken on buster grid - https://phabricator.wikimedia.org/T310097#10185578 (10taavi) 05In progress→03Invalid Not seeing anything actionable here other than poking maintainers, which would just mean this task would stay open until $END_OF_TIME. [13:33:44] 06cloud-services-team, 10Cloud-VPS: Extend Tofu provider to allow configuration via environment variables - https://phabricator.wikimedia.org/T321250#10185583 (10taavi) [13:35:04] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [registry-admission-webhook] Investigate why helm did not override the selector on the service on deployment - https://phabricator.wikimedia.org/T320665#10185581 (10taavi) Is this still relevant? [13:35:06] 06cloud-services-team, 10Cloud-VPS, 07Epic, 07Python3-Porting: WMCS: migrate python2 scripts to python3 - https://phabricator.wikimedia.org/T229920#10185587 (10taavi) [13:35:22] 06cloud-services-team, 10Cloud-VPS, 07Epic: Openstack: consider introducing log aggregation - https://phabricator.wikimedia.org/T255990#10185588 (10taavi) [13:38:37] 06cloud-services-team, 10Cloud-VPS, 07Epic: Openstack: consider introducing log aggregation - https://phabricator.wikimedia.org/T255990#10185590 (10taavi) 05Open→03Resolved I think this is done with logs now ending up in logstash. [13:38:43] 06cloud-services-team, 10Cloud-VPS, 07Epic: Cloud: reduce NAT exceptions from cloud to production - https://phabricator.wikimedia.org/T272395#10185592 (10taavi) [13:38:46] 06cloud-services-team, 10Cloud-VPS: Cloud VPS: drop wmflabs names from profile::resolving::domain_search - https://phabricator.wikimedia.org/T305834#10185594 (10taavi) 05Stalled→03Open [13:38:48] 06cloud-services-team, 10Cloud-VPS, 07Epic, 13Patch-For-Review: Cloud: reduce NAT exceptions from production to cloud - https://phabricator.wikimedia.org/T272585#10185593 (10taavi) [13:38:49] 06cloud-services-team, 10Toolforge: toolforge: consider relocating core k8s components out of puppet into its own repository - https://phabricator.wikimedia.org/T328539#10185595 (10taavi) [13:38:51] 06cloud-services-team, 10Cloud-VPS: [openstack] Review HAproxy setup to achieve load balancing or at least automatic failover - https://phabricator.wikimedia.org/T269687#10185597 (10taavi) [13:38:52] 06cloud-services-team, 10Cloud-VPS: cloudinfra hosts switching between 2 puppet changes / changes on every puppet run - https://phabricator.wikimedia.org/T263790#10185596 (10taavi) [13:38:54] 06cloud-services-team, 10Cloud-VPS: designate: database having deadlock problems - https://phabricator.wikimedia.org/T270762#10185598 (10taavi) [13:38:55] 06cloud-services-team, 10Cloud-VPS: Investigate and enable jumbo frames in cloudvirt nodes - https://phabricator.wikimedia.org/T273596#10185599 (10taavi) [13:40:01] 06cloud-services-team, 10VPS-Projects: cloudvirt-wdqs1001 getting out of space due to huge VM - https://phabricator.wikimedia.org/T273579#10185601 (10taavi) 05Open→03Invalid These cloudvirts were decom'd recently. [13:42:50] 06cloud-services-team, 10Cloud-VPS, 06SRE: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531#10185615 (10taavi) [13:43:06] 06cloud-services-team, 10Cloud-VPS: [cloudvirt] Enable and test jumbo frames to ceph osds - https://phabricator.wikimedia.org/T273792#10185611 (10taavi) Dupe of {T273596}? [13:43:27] 10Cloud-VPS, 06Infrastructure-Foundations, 10netbox, 10Puppet-Core: Make netbox the source of truth for cloudceph networks - https://phabricator.wikimedia.org/T338329#10185616 (10taavi) [13:44:13] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 10Sustainability (Incident Followup): cloud: monitor/alert on health of TLS certs used on shared front proxy setup - https://phabricator.wikimedia.org/T273959#10185605 (10taavi) This should be relatively trivial with Prometheus + blackbox (if we haven't done... [22:43:55] RECOVERY - Host cloudcephosd1025 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [22:47:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwv-builder-03.mediawiki-vagrant.eqiad.wmflabs is about to expire in 26d 23h 58m 34s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:50:19] PROBLEM - Host cloudcephosd1025 is DOWN: PING CRITICAL - Packet loss = 100%