[00:45:44] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:50:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:55:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [05:03:57] 10Mail, 06Infrastructure-Foundations, 06SRE: Message sizes exceeding limits after migrating from Exim to Postfix - https://phabricator.wikimedia.org/T383271#10460979 (10Aklapper) [08:41:31] 07Puppet, 06Data-Engineering-Radar, 06SRE: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10461173 (10fgiunchedi) 05Open→03Invalid Manifest doesn't contain unreachable code anymore ` define udp2log::instance::monitoring( $log_dir... [11:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:49] 10netops, 06Infrastructure-Foundations, 06SRE: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10461757 (10cmooney) >>! In T382518#10455949, @VRiley-WMF wrote: > This has been rebooted > > @cmooney would you be able to check this when you have a chance? Thanks for doing th... [14:31:11] hello, I am spooked by a cr1-eqiad homer diff again! besides the 4 hosts I renamed, there seems to be a whole lot of changes I don't understand, https://www.irccloud.com/pastebin/Y9hUafl3/ (cc topranks) [14:35:55] kamila_: wait for the authoritative answer but the ! lines should just be moved lines, so the order changed [14:36:06] TIL [14:36:17] why is that I don't know we should always generate the config in an ordered way [14:36:30] so probably some sorted() was missed somewhere [14:36:38] oh, ok, thanks volans [14:36:54] volans: sorted() missing on homer side you mean? [14:37:12] on homer's public or the plugin [14:37:23] okok [14:37:24] elukey: I know only because I wrote homer :D https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/+/refs/heads/master/homer/transports/junos.py#235 [14:38:12] makes sense ;d [14:38:29] :-D [14:44:08] yeah.... not sure what's happening there [14:44:27] we don't sort it cos Juniper will sort them however it does and they need to stay that way [14:44:35] no idea why it thinks they've all been re-ordered however [14:45:22] but either way the above is safe to proceed with [14:45:33] okay, thank you topranks! [14:45:43] there is no diff at all in it in fact, but I assume the paste is missing the last lines [14:45:46] np! [14:46:05] (yes, it's only the part that spooked me) [14:46:20] yeah I can well imagine when you see a huge "diff" like that [14:50:55] yeah [14:51:00] topranks: for my own knowledge - a diff like that, even if it is scary at first, is safe since we don't really care about the BGP neighbors ordering (and the config of the single neighbors will stay the same) [14:51:13] did I get it right? [14:51:26] yeah, and line starting with an exclamation mark - ! - can be safely ignored in all cases [14:51:31] *any [14:51:46] ah ok it is never a concern? [14:51:56] I thought that maybe somewhere ordering would count etc.. [14:51:58] if there is a change in the ordering of something (like an acl or routing policy), where the order does matter, it will show as a diff with +/- characters instead [14:52:00] if it is not the case good [14:52:02] ! is always safe [14:52:02] Sorry, you are not authorized to perform this [14:52:09] ! is always safe [14:52:10] ooh ok [14:52:14] thanks :) [14:52:17] ...so shoud we be even showing them if that's the case? [14:52:36] kamila_: we need to blame volans probably [14:53:02] the router returns the diff to us. maybe we could try to filter them out but they can sometimes be useful to indicate where another change is in the overall config [14:53:22] I would say if it's just the occasional head-scratcher like today it's not a major issue, if we start getting a lot of them we may need to do something [14:53:43] ok, fair [14:53:56] but in general we should not get reordering diffs no? [14:54:11] no [14:54:26] sometimes I see them and they can be expected [14:54:38] if say something is added manually during a maintenance or fault [14:54:49] and then the automation pushes the same config later but it's in a different place [14:55:18] but that's the only case I'd expect to see it [14:55:30] 10SRE-tools, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, 10Sustainability (Incident Followup): Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677#10462482 (10elukey) a:05elukey→03None [14:56:02] we could also experiment with sorting things in Homer. I know some elements the router will decide what order to insert them in the running config, others are left in the order they are pushed by homer [14:57:43] we do use a lot of "| sort()" in the jinjia files [14:57:50] hence my maybe we're missing one [14:59:00] indeed yeah [14:59:22] but seems to be there for the k8s hosts that changed in this case [14:59:24] includes/bgp/k8s.conf: {% for hostname, ips in bgp_neighbors.items() | sort() %} [14:59:56] although I wonder, is that sorting based on hostname? [15:00:48] perhaps sorting on IP would be better, is the work happening here renaming hosts but keeping their IPs? [15:00:52] I guess so [15:02:08] "ips" is also a dict like {4: , 6: ) so sorting directly on that won't work either [15:04:51] I'll bear in mind with the homer work coming up, there needs to be some refactor of those templates, as currently they embed info (such as the group ASN) which needs to be abstracted so that we can produce the equivalent config for Juniper or Nokia [15:06:28] k [15:06:38] yes, quite a few hosts are getting renamed and only renamed with no other network changes right now [15:07:08] but it's a somewhat one-off thing [15:07:41] but there still are something on the order of 100s left [15:07:57] Maybe a bit less, surely many 10s [15:08:08] 10Mail, 06Infrastructure-Foundations, 06SRE: Message sizes exceeding limits after migrating from Exim to Postfix - https://phabricator.wikimedia.org/T383271#10462544 (10DSeyfert_WMF) Thank you @jhathaway - initial tests from end users look good, thank you for your help resolving this! [18:04:40] moritzm: akosiaris: slyngs: is there any reason why we use profile::prometheus::squid_exporter on the install squids, but we don't on the urldownloader squids? [18:05:30] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10463812 (10ssingh) 05Open→03Resolved a:03ssingh This has now been fixed, thanks to @Volans! ` sukhe@cumin1002:~$ sudo cookbook sre.dns.admin show => CURRENT STAT... [18:19:10] the same request that the exporter sends to squid on install1003 works fine in a manual test on urldownloader1003 [18:19:16] so I'm going to enable it also [18:24:28] cdanis: no, I don't think there's any reason, just an oversifht [18:28:19] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10463938 (10Volans) 05Open→03Resolved This feature is now live and cookbook ownership can be clearly seen when listing cookbooks (`cookbook -l` or `cookbook -lv`) and at the botto... [18:29:54] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10463940 (10Volans) 05Open→03Resolved This is now live, see the related documentation in https://doc.wikimedia.org/spicerack/master/api/spice... [18:30:32] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655#10463945 (10Volans) 05Open→03Resolved This is now live, see the related documentation in https://doc.wikimedia.org/spicerack/master/api/spicerack.cookboo...