[03:11:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:11:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:16:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:16:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:20] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Apereo CAS to version 7.2 - https://phabricator.wikimedia.org/T406455#11310702 (10SLyngshede-WMF) [08:58:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [09:33:31] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11310845 (10fgiunchedi) We have successfully put in service cloudcephosd1050 and cloudcephosd1051 in {T405478} with single-nic, I haven't seen any problem whatsoever with... [09:49:42] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11310972 (10cmooney) >>! In T399180#11310845, @fgiunchedi wrote: > @taavi @Andrew @cmooney what do you think of the above? The plan sounds good. We need to audit and ma... [10:05:48] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378 (10cmooney) 03NEW p:05Triage→03Medium [10:13:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [10:16:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:54] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Apereo CAS to version 7.2 - https://phabricator.wikimedia.org/T406455#11311248 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1003 for hosts: `idp1004.wikimedia.org` - idp1004.wikimedia.org (**PASS**)... [11:00:08] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Apereo CAS to version 7.2 - https://phabricator.wikimedia.org/T406455#11311329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1003 for hosts: `idp2004.wikimedia.org` - idp2004.wikimedia.org (**PASS**)... [11:47:07] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11311458 (10cmooney) Ok well I fixed the obvious error but the alerts still aren't firing :( [12:18:02] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Apereo CAS to version 7.2 - https://phabricator.wikimedia.org/T406455#11311558 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1003 for hosts: `idp-test1004.wikimedia.org` - idp-test1004.wikimedia.org (**... [14:40:04] 10netops, 06Infrastructure-Foundations, 07Documentation: The links under "Test IP fragmentation issues" on `wikitech:Reporting a connectivity issue` no longer appear to work - https://phabricator.wikimedia.org/T407505#11312080 (10LSobanski) 05Open→03Resolved a:03LSobanski I removed the section as t... [14:40:45] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833#11312085 (10LSobanski) a:03cmooney [18:29:08] how do you create a VM in the POPs that have routed ganeti (like magru or drmrs)? the makevm cookbook does not recognize these as valid clusters [18:35:00] mutante: what happens if you pass storage type to plain [18:35:19] parser.add_argument('--storage_type', choices=STORAGE_TYPES, default='drbd', [18:35:22] help='the storage type of the VM. One of %(choices)s, default to drbd') [18:35:34] drbd won't work in routed so per Spicerack's API, try plain instead [18:35:47] this seems to be confirmed by [18:35:48] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151678 [18:35:52] > For the migration of 2x2 clusters to routed Ganeti we need to be able to [18:35:55] create VMs on an initial one node cluster (where DRBD won't work). [18:36:04] I am not sure though why the help message doesn't say that but probably because we missed it (that is, if this even works) [18:37:41] sukhe: thank you! I will try this right after the current run is finished. (just because it doesnt seem a good idea to run multiple at once) [18:40:23] well, I tested if it starts. but passing --storage_type does not change that it gets "The request failed with code 400 Bad Request: {'group': ['Select a valid choice. drmrs is not one of the available choices.']}" [18:40:38] mutante: out of curiosity, what's the full comand [18:40:39] *command [18:41:06] sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 2 --storage_type plain --disk 20 --cluster drmrs -t T408064 --os trixie tcp-proxy6001 [18:41:15] T408064: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064 [18:41:52] mutante: we have to specify the group as well? [18:42:16] not in POPs that have only one group [18:42:28] unlike eqiad/codfw where I do pick one of the groups [18:42:29] ah true, drmrs [18:42:39] so weird why this fails [18:42:52] i was looking for something like "list all the clusters" [18:42:52] "select a valid choice" [18:43:14] mutante: er wait no [18:43:17] ah, it links to https://netbox.wikimedia.org/virtualization/cluster-groups/ [18:43:25] drmrs should have B12andB13 [18:43:30] drmrs01 ! [18:43:31] that's how I did durum and doh [18:43:36] yeah that yep [18:43:39] esams01 :p [18:44:07] ok, just needed to see the netbox link. so it seems with routed ganeti the "group" is part of the actual cluster name [18:44:42] now I wonder if I should also use "codfw02" and routed ganeti for the VM in codfw [18:44:52] good question, not sure [18:45:07] I can just try it out. wanted to debug more anyways [18:45:10] thanks sukhe [18:45:54] mutante: sorry not being helpful but yeah, Moritz can answer that one [18:46:27] still was!:) like rubberduck debugging [19:18:32] DRBD works fine in routed Ganeti ganeti, this was only a limitation when the new clusters were being setup and only had a single node [19:18:55] ah! [19:19:07] adds up to what you told me ok [20:06:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11313615 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [20:07:28] mutante: yep thanks! [20:19:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:53:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11313740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [21:30:47] 07Puppet, 06Infrastructure-Foundations, 06SRE: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#11313878 (10Krinkle) [21:33:27] 07Puppet, 06SRE: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#11313893 (10Krinkle) [21:42:55] 07Puppet, 06Infrastructure-Foundations, 06SRE: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#11314006 (10Krinkle) [22:17:00] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11314270 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d57495a-c8c9-4142-bb4a-68c98114d4d1) set by cmooney@cumin1003 for 3 d... [22:29:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag