[08:06:55] 10Acme-chief, 10Traffic: Provide second acmechief server configured for Puppet 7 in eqiad - https://phabricator.wikimedia.org/T352242 (10MoritzMuehlenhoff) [08:49:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on ncredir2002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:18:14] 10Traffic: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10Vgutierrez) [09:18:43] 10Traffic: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10Vgutierrez) p:05Triage→03Medium [09:42:35] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/ipip-multiqueue-optimizer/-/merge_requests/5 Catch SIGTERM signal [09:53:59] jbond: ^^ any recent change on ferm for ncredir? [09:54:28] hmm it's only happening on ncredir2002 [09:55:05] vgutierrez: not to my knowlage [09:56:06] ferm.service is getting started by puppet on every run apparently [09:56:08] vgutierrez: it could be something strange with the ferm-status script not knowing how to parse proto=4 [09:56:23] not sure why it only affects one host though [09:56:33] jbond: then we should be seeing that on the whole ncredir cluster [09:56:49] worth checking I guess [09:57:03] yes thats what i would expect but of the top of my head this still seems like the most likley candidate [09:59:02] ValueError: 172.16.0.0/10 has host bits set [09:59:03] vgutierrez: i have a feeling the other machines will trigger that alerts shortly [09:59:04] eh :) [09:59:19] that should be 172.16.0.0/12 [09:59:54] is that a typo in the puppet config? [10:00:05] typo in my ferm rule [10:00:11] ahh cool [10:02:32] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978482/ [10:03:39] vgutierrez: +1 [10:04:29] should that ferm rule be in profile::lvs::realserver::ipip instead? [10:04:59] (PuppetConstantChange) firing: (3) Puppet performing a change on every puppet run on ncredir1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:06:35] taavi: technically yes, but not all realservers are using ferm [10:07:07] I guess I could wrap it with an if clause checking for ferm classes on the catalog [10:07:08] ferm::service is a no-op if a particular node has no ferm installed [10:07:17] can't use ferm::service [10:07:42] dunno if that applies to ferm::rule as well [10:08:28] it does. they're all doing an exported resource that then gets imported in the ferm class, instead of adding the file resource directly [10:09:16] ack, I'll submit another CR to that effect [10:09:59] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on ncredir1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:14:29] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978484 [10:16:25] 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10Fabfur) [10:18:49] jbond: BTW, ferm-status is happy with proto=4 [10:24:59] (PuppetConstantChange) firing: (4) Puppet performing a change on every puppet run on ncredir1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:36:40] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/6 Clean up on SIGTERM [10:36:50] vgutierrez: great news [10:39:09] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/ipip-multiqueue-optimizer/-/merge_requests/5 Catch SIGTERM signal [10:45:45] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/ipip-multiqueue-optimizer/-/merge_requests/6 Release 0.3+deb11u1 [10:51:21] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/ipip-multiqueue-optimizer/-/merge_requests/6 Release 0.3+deb11u1 [11:06:51] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10Vgutierrez) 05Open→03Resolved `Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on vlan1201... [11:08:36] 10Traffic, 10Patch-For-Review: Consolidate hieradata for new eqiad cp hosts - https://phabricator.wikimedia.org/T352078 (10Fabfur) 05Open→03Resolved This is complete [11:09:29] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/6 Clean up on SIGTERM [11:13:03] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez opened https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/7 Release 0.3+deb12u1 [11:14:19] 10Traffic, 10Patch-For-Review: ipip-multiqueue-optimizer should unload eBPF programs on service stop - https://phabricator.wikimedia.org/T352249 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/7 Release 0.3+deb12u1 [11:21:36] 10Traffic, 10DC-Ops, 10ops-eqiad: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10Fabfur) [12:05:00] (PuppetConstantChange) firing: (3) Puppet performing a change on every puppet run on ncredir1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:05:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `cp[1075-1090].eqiad.wmnet` - cp1075.eqiad.wmnet (**PASS**) - Downtimed hos... [12:09:00] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [12:09:42] (SystemdUnitFailed) firing: tcp-mss-clamper.service Failed on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:59] (PuppetConstantChange) resolved: (3) Puppet performing a change on every puppet run on ncredir1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:10:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) 05Open→03Resolved All activities for this task have been completed, refer to the other linked tasks for more details on decommissioning old hosts [12:10:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Fabfur) [12:19:42] (SystemdUnitFailed) resolved: tcp-mss-clamper.service Failed on ncredir4001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:37] 10Traffic, 10SRE, 10Patch-For-Review: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers - https://phabricator.wikimedia.org/T352160 (10Vgutierrez) 05Open→03Resolved [12:24:41] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:24:56] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:26:18] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [12:34:48] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [14:06:05] godog: something is not happy with /srv/prometheus/ops/target/lvs_realserver_clamper_ulsfo.yaml [14:06:24] no hosts listed there on prometheus4002 [14:07:04] I'll take a look vgutierrez [14:07:49] introduced here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/975342/11/modules/profile/manifests/prometheus/ops.pp [14:09:10] vgutierrez: godog: you're using ::resource_config but profile::lvs::realserver::ipip is a class not a resource type [14:09:47] yes what taavi said [14:10:10] hmm so profile::mjolnir::kafka_msearch_daemon_instance must be a define [14:10:13] rather than a class [14:10:27] yep [14:11:56] 10Traffic: Provide better error pages for HAProxy - https://phabricator.wikimedia.org/T352291 (10Fabfur) [14:24:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/978608 that should fix it [14:24:19] hmmm and PCC made me realize we never set cluster: ncredir for ncredir hosts :/ [14:27:38] hah ncredir reminds me of this task re: its metrics https://phabricator.wikimedia.org/T351934 [14:27:56] who would be the best person to talk to about it ? [14:28:19] me [14:28:26] let's ditch vhost :) [14:29:20] vgutierrez: haha ok! simple enough, I'll send the patch [14:29:38] after meetings that is [14:29:39] hmm setting the cluster to ncredir on hiera makes puppet to fail on prometheus4002 [14:30:02] actually what fails is puppet on ncredir4001 [14:30:03] :_) [14:30:23] oh right [15:19:52] 10Traffic: Create metrics/monitoring of fifo-log-demux - https://phabricator.wikimedia.org/T345939 (10BBlack) I went on a different tangent with this problem, and tried to figure out //why// we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem... [15:33:42] 10Traffic: Create metrics/monitoring of fifo-log-demux - https://phabricator.wikimedia.org/T345939 (10BBlack) Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by... [16:29:46] godog: BTW.. I'm guessing that new checks/alerts should be added to prometheus node_exporter + alertmanager, right? [16:31:03] vgutierrez: that's correct yes [16:31:13] oh.. prometheus-sysctl is already there [16:31:17] that's convenient :) [16:31:52] sigh.. talked too soon [16:32:04] that's quite a naive implementation :_) [16:37:49] 10Traffic, 10Observability-Metrics: Label value spam in ncredir_requests_total metric - https://phabricator.wikimedia.org/T351934 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, `job="ncredir"` metrics are now two orders of magnitude less {F41546365} [16:41:22] godog: hmmm prometheus-sysctl mentions that node exporter doesn't expose some sysctls [16:41:40] godog: but looking on grafana I don't see any node_sysctl metric at all available [16:42:03] we need to enable explicitly? [16:42:13] vgutierrez: in a meeting, will answer later [16:42:15] ack [16:57:50] 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10Volans) @ssingh what's your timeline to switch to use this new method to get what DNS hosts are pooled? As you know we need to adjust spice... [17:00:07] 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9368046, @Volans wrote: > @ssingh what's your timeline to switch to use this new method to get what DNS hosts are po... [17:14:45] godog: for some context I want to alert on rp_filter being enabled on lvs::realserver::ipip instances [17:21:23] vgutierrez: ack, yeah the general idea of prometheus-sysctl is sound I'd say, for sure it could use a little configurability [17:22:28] I can dump sysctl -a | grep -F .rp_filter [17:22:49] But I'm guessing that I need to add the type comment per sysctl, right? [17:24:26] yeah that's possible, I can't remember if help/type are compulsory rn [17:27:23] gotta go [17:35:42] (SystemdUnitFailed) firing: haproxy.service Failed on cp2029:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2029:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2029&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [17:35:55] ^ this is cdanis [17:36:08] so don't worry, known [17:36:12] ah sorry :) [17:36:28] cdanis: all good, just for traffic, we have some trauma with this alert :P [17:36:46] https://phabricator.wikimedia.org/T334448 specifically [17:39:19] I put in an alertmanager silence for cp2029, I think [17:51:42] 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) For awareness that on `dns6001`, we have rolled out setting the ferm rules for authdns-update via the confd-managed file and have t... [20:41:41] 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) Summary of changes today: - List of servers in `authdns-update` is also now managed by confd, for just dns6001. - On all DNS hosts, the `/etc/resolv.con...