[01:13:00] (HAProxyRestarted) firing: HAProxy server restarted on cp3054:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3054&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [05:13:00] (HAProxyRestarted) firing: HAProxy server restarted on cp3054:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3054&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [07:52:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp3054:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3054&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:53:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:59:57] (that was me with a kill -11 [10:03:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:27:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:28:19] (again my doing) [10:29:29] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10Volans) [10:32:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:45:22] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [11:11:25] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [11:16:01] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [11:53:22] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Ladsgroup) [13:51:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:51:51] (me again) [13:52:33] sukhe, moritzm even with https://gerrit.wikimedia.org/r/c/operations/puppet/+/909209 merged, a kill -11 doesn't trigger a core for haproxy on /var/tmp/core :/ [13:53:19] vgutierrez@cp2033:/var/tmp/core$ sudo -i systemctl show haproxy.service -p LimitCORE [13:53:19] LimitCORE=infinity [13:53:19] vgutierrez@cp2033:/var/tmp/core$ sudo -i systemctl show haproxy.service -p LimitCORESoft [13:53:19] LimitCORESoft=infinity [13:53:19] vgutierrez@cp2033:/var/tmp/core$ sudo -i systemctl show haproxy.service -p ReadWritePaths [13:53:20] ReadWritePaths=/run/haproxy /var/lib/haproxy /var/cache/ocsp /var/tmp/core [13:53:20] vgutierrez@cp2033:/var/tmp/core$ sudo -i systemctl show haproxy.service -p PrivateTmp [13:53:20] PrivateTmp=no [13:54:18] kernel.core_pattern = /var/tmp/core/core.%h.%e.%p.%t [13:54:23] looks OK too [13:54:34] yep [13:55:03] http://docs.haproxy.org/2.6/configuration.html#set-dumpable [13:55:19] might be /proc/sys/fs/suid_dumpable [13:55:49] is currently set to 0 [13:59:13] setting it to 2 doesn't make a difference though [13:59:37] LimitFSize is already infinity... [13:59:46] so I'm running out of ideas [13:59:57] strange, how about we test systemd-coredump on a handful of hosts? it should at least give us better control/visibility. lots of development happened between Stretch->Bullseye and https://phabricator.wikimedia.org/T236253 might be moot these ays [14:00:37] yeah we had that patch on Friday but abandoned it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/908934/ [14:00:50] we thought we don't need it as we already set kernel.core_pattern [14:01:14] vgutierrez: should we add "set-dumpable" to the global options in haproxy.cfg? or is that not required? [14:01:41] it seems like on an existing ticket they seem to be setting that explicitly, not sure if even related to ulimits (we are good on those anyway) https://github.com/haproxy/haproxy/issues/669 [14:02:04] it can't hurt [14:02:10] let me test it on cp2033 (currently depooled)= [14:03:41] vgutierrez@cp2033:/var/tmp/core$ sudo ls [14:03:41] core.cp2033.haproxy.2742121.1681740208 [14:03:52] set-dumpable did the trick [14:04:22] oh great then [14:04:58] I'll add a CR with that as well [14:09:22] lot of ways to stop a coredump from happening :) [14:12:50] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/909287 [14:13:12] template gets rendered on ::haproxy context, so we can not use enable_coredumps there [14:23:16] validated in cp3054 [14:23:17] -rw------- 1 haproxy haproxy 417M Apr 17 14:22 core.cp3054.haproxy.2178432.1681741376 [14:23:30] we get a core dump after running puppet and restarting haproxy [14:23:37] (and triggering it with a kill -11 of course) [14:26:44] (VarnishHighThreadCount) firing: (3) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:26:45] (HAProxyRestarted) firing: (2) HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:31:44] (VarnishHighThreadCount) firing: (11) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:36:44] (VarnishHighThreadCount) firing: (11) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:36:45] (HAProxyRestarted) resolved: (2) HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:51:44] (VarnishHighThreadCount) firing: (10) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:16:44] (VarnishHighThreadCount) firing: (16) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:36:44] (VarnishHighThreadCount) resolved: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount