[07:29:40] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Marostegui) 05Open→03Declined I think we can probably decline this. Orchestrator removes the host itself after 14 days,... [09:04:19] thx for fixing the cumin aliases moritz [09:05:02] what's up with aqs1013? [09:06:22] broken disk [09:06:35] I'm wondering why no task was auto-created for it? [09:07:05] https://phabricator.wikimedia.org/T352344 ? [09:07:06] and these alerts should really only go towards the owners and maybe -operations [09:07:27] if a task is auto-created it shouldn't alert on irc [09:07:32] ah, right, it was already closed then [09:07:42] few things here [09:07:59] 1) this is a new alett on alertmanager and not icinga, we currently have both IIRC [09:08:03] md recoverty is ongoing for it [09:08:25] for md1/md2, md0 already completed [09:08:27] see https://phabricator.wikimedia.org/T350694#9358223 [09:09:37] it has been alerting since before the weekend too [09:10:47] volans: eh, I guess you said it all on that comment :) [09:11:02] :D [09:24:49] Maybe we don't actually need to alert on md failures, when a task is created anyway. We still need to work out what do to with the scripts that creates the tasks, if the plan is to be able to remove Icinga at some point [09:25:39] I didn't see any auto-generated task for this one and I'm not sure if anyone had a look at why [09:27:20] I'll put it on the list :-) [09:29:14] 10netbox, 10Infrastructure-Foundations: Graph reports status in Prometheus - https://phabricator.wikimedia.org/T262898 (10joanna_borun) 05Open→03Declined [09:34:03] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are su... [09:47:52] 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10JMeybohm) I don't exactly recall thb. but I would imagine I wanted something like this in one of the pool/depool/service-route cookbooks to store the TTL, lower it, change wha... [09:48:44] 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10Volans) Ack, thanks for the info [10:12:58] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) 05Open→03Resolved Deleted. [12:10:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [12:11:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) p:05Triage→03Medium [12:54:20] FYI the Lumen link between esams and eqiad is down, that's why we're getting the alert about overusage on the backup GTT link https://librenms.wikimedia.org/bill/bill_id=17/ [13:03:53] :( [13:09:04] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:04] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:40] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:32:21] Netbox v3.7-beta1 is out [13:32:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [13:35:26] the good thing is that we're not impacted by breaking changes of features released after 3.3 :) [13:37:48] so many great features [13:38:22] "A new PROTECTION_RULES configuration parameter is now available. [...] This enables an administrator to prevent, for example, the deletion of a site which has a status of "active."" [13:38:50] that would save us from deleting a prod device by accident [13:46:24] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [13:46:31] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [13:50:07] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:05] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:46] (NTPNoSynced) firing: NTP not synced - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.monitoring.wmflabs.org/?q=alertname%3DNTPNoSynced [14:40:26] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:50:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [15:54:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10ops-codfw: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [16:08:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [18:36:46] (NTPNoSynced) firing: NTP not synced - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.monitoring.wmflabs.org/?q=alertname%3DNTPNoSynced [22:36:46] (NTPNoSynced) firing: NTP not synced - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.monitoring.wmflabs.org/?q=alertname%3DNTPNoSynced [23:49:05] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed