[05:35:56] (VarnishTrafficDrop) firing: 66% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [06:00:56] (VarnishTrafficDrop) resolved: 68% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [09:02:35] 10Traffic, 10SRE, 10Toolhub, 10Patch-For-Review: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10ema) @bd808: looks like we're all set! ` $ curl -sv -X PATCH https://phabricator.wikimedia.org/T293157 2>&1 | grep '< HTTP/2' < HTTP/2 200 ` [09:03:24] * volans available for that reimage fwiw :) [09:04:25] hey! [09:04:31] yeah let's do it [09:05:27] wanna do the honor? [09:05:34] so, last time I've encountered the error, I added --debug to the list of puppet args: https://phabricator.wikimedia.org/T201317#4481576 [09:06:15] shall we do that? [09:07:30] how to do that with the new reimage cookbook? :) [09:08:26] you would have to live patch spicerack... not ideal, we do really need that or can we see if it goes or gets stuck and then debug only if it gets stuck? [09:08:40] but it's surely an option (live patch) [09:09:53] yeah given that the process takes a while I'd rather try with debug enabled from the get go [09:10:05] let me just check one thing [09:10:07] that might be an issue [09:10:37] ack [09:12:16] so my question is... will the report get back to the puppetmaster or the puppet run is completely stuck? because the puppet output will not be on the console (too spammy) and if it doesn't make to puppetmaster -> puppetdb -> puppetboard we'll not be able to see it [09:12:39] I can live patch that too to see the puppet output on the console [09:13:19] back in 2018 I live patched wmf_auto_reimage_lib and I did get the output somewhere [09:13:57] I don't remember where it was though :) [09:15:06] it was on a file but was thanks to a hack, now we're telling cumin to not print it [09:15:14] and I can flip that switch [09:15:24] let me do that on cumin2002 (less people using it) [09:15:28] ack [09:16:32] ok live patch present on cumin2002 [09:16:49] make sure to have a tmux/screen with infinte scroll though [09:17:50] alright, depooling is done automatically right? [09:18:03] if you pass conftool option yes [09:18:05] like the old one [09:18:17] excellent [09:18:22] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Reimage_of_a_generic_host_behind_LVS [09:18:29] uses inactive by default [09:18:39] if you have to pass 'no' instead there is the related option [09:19:53] sudo cookbook sre.hosts.reimage --os buster --conftool -t T201317 cp4021.ulsfo.wmnet [09:19:53] T201317: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 [09:19:59] does that look good? [09:20:07] vgutierrez: o/ do you have 10 mins this afternoon to sanity check https://gerrit.wikimedia.org/r/c/operations/puppet/+/719239 ? If you are around later on I'll also apply the changes to the eqiad pybals [09:20:37] ema: hostname, not FQDN, apart that LGTM [09:20:46] elukey: between 16:00 and 18:00 CEST works for you? [09:20:54] * volans note to self: add some logic to the reimage to better tell the user that [09:22:08] volans: ack, doing! [09:22:41] ack, folloing along [09:24:53] 10Traffic, 10SRE, 10User-ema: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 (10ema) 05Open→03Resolved a:03ema Runbook and updated dashboard link are shown correctly. Closing. ` 06:00 < jinxer-wm> (VarnishTrafficDrop) resolved: 68% GET dr... [09:26:23] vgutierrez: yes perfect thanks! [09:27:47] volans: potential improvements: (1) accept FQDN too (2) pass --debug to puppet [09:28:41] ideally the latter should also do whatever is necessary in spicerack to get the output back to terminal [09:29:12] that will be very spammy though [09:30:37] I think it's fine, you only want to enable it if things are messed up [09:32:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [09:33:19] interesting use case.... do you have alerts outside of icinga? [09:33:20] ah, I suppose that alertmanager alerts don't get downtimed [09:33:44] eh... not yet is there any doc on how to do that? [09:34:20] here's our two AM alerts: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-traffic/traffic.yaml [09:35:33] volans: the manual procedure is here https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [09:35:48] not sure how to do that programmatically though [09:35:56] ack, I'll ask o11y for the equivalent APIs [09:40:11] ema / vgutierrez any idea why if I make a request like follows: curl https://www.mediawiki.org/static/current/skins/Timeless/resources/print.css -I -H 'X-Wikimedia-Debug: backend=k8s-experimental' it gets translated to http://en.wikipedia.org/static/current/skins/Timeless/resources/print.css in the request to the backend? [09:40:37] joe: yeah, that's done via VCL to increase hitrate [09:40:40] If I make the same request directly to the k8s backend, it works [09:40:56] the assumption is that static files don't differ between Host req header values [09:41:02] lol [09:41:11] our docroots would like to have a word [09:41:36] I mean, it isn't unreasonable [09:42:03] but what I see in operations/mediawiki-config/docroot doesnt' always match that expectation [09:42:10] ema: where is that in the vcl? [09:42:13] modules/varnish/templates/text-frontend.inc.vcl.erb:223: if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; } [09:42:33] yeah so, /static/current *is* indeed common [09:54:22] ema: still ongoing? I just attached to your tee file [09:55:44] I think it might have hit the same issue [09:56:18] although doesn't explicitly say closing connection [10:03:32] not sure if you did anything, seems to have resumed [10:31:38] hey, nope [10:31:41] I haven't touched it [10:32:13] at what point did it seem to get stuck? [10:32:48] search for ip token set ::10:128:0:121 [10:32:50] it would be useful to prefix the log entries with the time, as in `whatever | ts` [10:33:06] gotta go for lunch, bbl! [10:33:11] for me was stuck few lines afer [10:33:17] at [10:33:18] Debug: /Stage[main]/Etcd::Client::Globalconfig/Etcd::Client::Config[/etc/etcd/etcdrc]/File[/etc/etcd/etcdrc]: The container Etcd::Client::Config[/etc/etcd/etcdrc] will propagate my refresh event [10:43:01] and now it's stuck at installing libvmod-re2 because of debmonitor-client, interesting [10:44:20] and while debugging completed (I did attach myself to the PID so maybe that "solved" the issue) [10:49:50] ema: all log entries have timestamps in the logfile, the puppet output not though, and yes we could pass it through ts [10:52:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [10:56:23] unrelated: puppet is failing on the first run after the reboot too, varnishmtail can't start, see [10:56:26] https://puppetboard.wikimedia.org/report/cp4021.ulsfo.wmnet/77fa5503ed43732bf2ee339ba02351a822939061 [11:02:56] only at the third run (1st puppet run, @reboot run, 3rd run) puppet was successful [11:08:36] ema: reimage cookbook completed, icinga was not optimal so the downtime was not removed, see the final message in the console or the update in the task [11:09:13] I'm going to get lunch in few minutes, ttyl [11:15:55] volans: yeah, we know that the puppetization does not converge at 1st run and kinda concluded that it's a problem not really worth fixing in the great scheme of things [11:16:57] so the interesting thing here is that the reimage worked without external intervention [11:32:33] trying again, this time with ts [11:35:53] 10Traffic, 10SRE, 10Toolhub, 10User-ema: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10ema) [11:43:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [11:45:42] known ^ [11:48:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [12:07:03] oh, doh [12:07:21] I don't see any debugging info from puppet, so I assume volans patched spicerack back [12:19:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [12:45:43] ema: yeah sorry I did undo my patches that were in a stopped vim before logging out [12:45:56] but I guess we'll get the same info [12:46:11] more or less [12:46:30] sure, I wanted to get the timings though :) [12:46:32] as for ts... do you happen to know if it's available on a base d-i system in all stretch/buster/bullseye? [12:47:09] ts is in the moreutils package, which I doubt is in the base system [12:47:30] yep, I meant we'll see that the first big time gap should be around the ip set and we can check previous run to see [12:47:37] yeah it needs to be specifically installed (it's Priority: optional) [12:48:12] ema: actually https://puppetboard.wikimedia.org/report/cp4021.ulsfo.wmnet/138283b48c83d68707577e71683624b1ca45b779 has already the timestamps [12:51:35] volans: ah nice [12:51:59] what would be great is "duration" though [12:52:24] eh... [12:56:35] meeting soon, back in 1h [12:57:36] I'm extracting them from the puppetboard html and doing a delta, seems pretty easy [12:59:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4021:9331 is unreachable - https://alerts.wikimedia.org [13:04:13] ema: insterstingly enough we have 13 "stuck" places and *all* of them take between 240 and 253 seconds, with most of them in the 242-243 seconds range [13:05:46] putting them in a file and updating the task [13:09:09] ok, all of them are apt-get update or install, and I bet is related to debmonitor's client based on what I've seen on the host [13:11:51] according to the logs the set ip command didn't take any time [13:23:02] so AFAICT it's the apt-get install of debmonitor-client and all subsequent apt-get update/install [13:23:50] I've told both mor.itz and jo.hn about this, in case it rings a bell, not sure if might be related to some recent change (like the cfssl cert) [13:42:38] ema, jbond: https://phabricator.wikimedia.org/P17470 [13:43:12] I can't find any logs on the debmonitor side, neither apache nor debmonitor itself, so I'm guessing is a resolution/connection/routing/firewall issue at this point [13:43:22] out of curiosity have we tried to jjst reimage to see if it was a transient error or not? [13:43:23] the timeout seems to be 4 minutes (240s) [13:43:36] ema is reimaging right now so we can double check [13:43:54] grabbing the data from the new run as we speak [13:44:39] oh which host is this data from, i was looking at cp4012 [13:44:57] cp4021 [13:45:13] reimage finisehd already [13:45:15] oh ok then im looking at the most recent data i guess. [13:45:29] from what i can see from syslog and the apt log only debmonitor-client took ~4 mins [13:49:35] jbond: from last run [13:49:56] apt-get update at 12:19:37.206 ended at 12:23:41.422 [13:49:57] and so on [13:50:01] 10Traffic, 10SRE, 10Toolhub, 10User-bd808, 10User-ema: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10bd808) 05Open→03Resolved a:03bd808 Confirmed to be working, Thank you for the quick attention @ema. [13:50:52] with other packages [13:51:07] (logs are slightly different due to missing the --debug option in puppet on this last one) [13:51:12] volans: it looks like the firstpuppet run happened at a round 11:51 [13:51:26] yes, started at that time [13:51:34] 2021-10-13 11:55:00.052000+00:00 2021-10-13 12:52:53.926000+00:00 [13:51:41] so the times above would be second or third puppet run [13:51:42] https://puppetboard.wikimedia.org/report/cp4021.ulsfo.wmnet/87716f7f2d133430393a062208c29bee2430828c [13:52:00] no, same run, I took the logs from puppetboard [13:52:20] oh i see the entire tun takes an hour [13:52:49] fyi the exec timeout for apt-get update is 240 seconds [13:53:48] but if you do run it now it takes usual few secs [13:54:06] I'm not sure if this might be related to the ip set that was causing the original issue [13:54:49] https://phabricator.wikimedia.org/T201317#4481576 [13:55:03] i mean it could be however if there was a genral network issue puppet agent wouldn;t be able to fetch `sources` from the puppet master [13:55:06] at this time the reimage didn't need any manual intervention, it does complete on its own [13:55:35] but apparently we have debmonitor stuck for 4 minutes each time an apt operation is done after it's installed [13:56:49] ack let me try and recreate on sretest1001 [13:56:59] what timeout/TTL do we have at 4 minutes? debmonitor-client sets the timeout to requests to 3s IIRC [13:57:28] jbond: I didn't check specifically this but might not happen and be caused by some cp-specific network fine-tuning [13:57:28] see above: [13:57:34] fyi the exec timeout for apt-get update is 240 seconds [13:57:48] on debian side? [13:57:53] no in puppet [13:57:58] ahhh sorry I thought you meant in the logs [13:57:59] got it [13:58:09] so puppet is interrupting it got it [13:58:16] i.e. puppet kills it if its longer then 240 seconds [13:58:22] yes [13:58:34] im note sure what the timeout is on the package resource [13:59:00] jbond: no need to reimage [13:59:05] we still have the last reimage run on puppetboard [13:59:08] I can extract the data [13:59:12] but the first run has a lot of redundent apt-get updates to deal with dependencies [13:59:16] https://puppetboard.wikimedia.org/report/sretest1001.eqiad.wmnet/64f17d9417c774c08c60af4d36351e338cc73b90 [13:59:29] volans: i want to kill puppet agent run halfway through and try to run commands manually [13:59:29] give me 5 [13:59:50] yes but might be lost time if we can't repro there [14:00:14] ack [14:01:34] volans: debmontor client too ~4 minutes on sretest [14:01:38] *took [14:01:49] (.. to install) [14:02:07] there is only one event that took more than 30s [14:02:12] and took only 125.726 [14:02:30] and was Exec[cretate chained cert /etc/debmonitor/ssl/debmonitor_ _sretest1001_eqiad_wmnet.chain.pem]/returns [14:02:38] jbond: where did you see that? [14:02:39] https://phabricator.wikimedia.org/P17471 [14:02:43] the apt log [14:02:53] oh wait its 2 minutes [14:02:57] vgutierrez: o/ lemme know when you have 10 mins :) [14:03:00] that's 2 minutes [14:03:09] the 125s [14:03:10] elukey: right now [14:04:26] volans: i guess its possible that the first submission takles a long time as it has to submit a lot of data. i wonder if it: fails to submit for a geniune timeout; gets killed; recursion; [14:04:38] vgutierrez: nice :) So I'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/719239 and next patch, it is a new LVS service called inference.discovery.wmnet (k8s backend) that is only in eqiad atm (we are waiting a little bit to bootstrap the codfw cluster). [14:04:51] lemme know if the config make sense [14:04:55] * vgutierrez checking [14:04:56] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021. - https://phabricator.wikimedia.org/T293251 (10AlexisJazz) [14:05:16] volans: i guess we could in theory test that by removing a cp node from debmonitor and then doing the initial submission again? [14:05:35] if so my understanding is that I'd need to merge the two, run puppet on lvses and then restart pybal on lvs1016 and lvs1015 (seconday and primary low trafic) [14:05:38] *traffic [14:05:53] 10HTTPS, 10Traffic, 10SRE, 10Beta-Cluster-reproducible: The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021. - https://phabricator.wikimedia.org/T293251 (10Majavah) [14:05:59] jbond: but how to explain that I don't see access logs on apache? [14:06:13] elukey: 30443 looks like a TLS enabled port [14:06:14] because the post never completes [14:06:44] shouldn't be logged somewhere? [14:06:48] access/error [14:06:56] elukey: could it make sense to check that a TLS handhsake can be performed instead of a plain check_tcp? [14:07:01] *handshake [14:07:13] volans: i dont think so, althugh ema/vgutierrez may have a better response [14:07:42] however i thikn ifs puppet kills the connection before the client has completed sendint the entire POST apahe wont log anything [14:07:56] becaue it hasn;t even completed the tcp stream at that point [14:08:03] I doubt it takes so long [14:08:16] like i said easy to test and rule out [14:08:22] would be really weird but sure go ahead and test it [14:08:54] vgutierrez: good point, didn't think about it, yes I think it could be a good alternative (it is indeed a TLS port) [14:09:44] elukey: so before running puppet on the lvs to let pybal get the new config the realservers should get the service IP [14:11:20] vgutierrez: I thought to run it on the kubernetes workers before restarting pybal, isn't if sufficient? [14:11:26] (asking to understand) [14:11:34] yes, that's enough [14:12:18] ack.. I was looking for the right check for pybal, is it check_ssl? [14:13:38] or check_tcp_ssl [14:15:22] check_tcp_ssl!inference.discovery.wmnet!30443 [14:15:58] (updated my patch) [14:16:42] hmm [14:17:23] check_tcp_ssl seems ok [14:18:01] I see that the codfw IP is already reserved on netbox, nice <3 [14:23:50] thanksss!! I'll merge the first, run on kubernetes nodes, then the second, then puppet on lvses etc.. [14:24:06] they seem sane by my standards [14:24:07] 1016 first, check ipvs, then 1015 [14:24:14] super thanks a lot :) [14:24:16] yep, that's right [14:24:21] will ping you if anything looks weird [14:39:24] mmm after the pybal restart I see the new LVS IP but without any backends listed :( [14:47:29] volans, jbond: I see that cp4021 is back alive and kicking [14:47:34] can I repool it? [14:47:51] or do you want to torture it some more? [14:48:08] ema: I didn't do anything to it, just checked some logs [14:48:39] I think john is able to partially repro on sretest1001 so I guess for now we could try to debug there [14:49:08] alright! thanks :) [14:49:14] yes i think we can use sretest to debug [14:59:22] vgutierrez: all good! The inference.svc.eqiad.wmnet endpoint works nicely, but the .discovery.wmnet one doesn't, I think I am missing https://gerrit.wikimedia.org/r/c/operations/dns/+/730541 [14:59:59] elukey: indeed, but you also need to move the service to production first [15:00:17] as in service.yaml state definition [15:00:30] vgutierrez: ahhh okok, I'll wait then, no rush for the moment (too many things in flux [15:00:34] thanks for the help! [15:03:44] basically cause discovery services are filtered like this: $discovery_services = wmflib::service::fetch().filter |$n, $svc| { 'discovery' in $svc and $svc['state'] == 'production' } [15:06:35] makes sense yes, I forgot that I am not done with LVS, but I am happy to work with inference.svc.eqiad.wmnet for the moment [15:06:56] (the ml-serve stack is not fully finalized so I'd like to avoid paging people for nothing :D) [15:07:02] ack [19:03:29] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt prefix/range - https://phabricator.wikimedia.org/T293294 (10RobH) p:05Triage→03High [19:03:43] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt prefix/range - https://phabricator.wikimedia.org/T293294 (10RobH) [19:05:24] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10RobH) [19:06:00] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10RobH) I chatted with @cmooney about this in IRC and we cannot see if there is a set pattern to which ranges are used for mgmt, private1,... [19:10:57] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10cmooney) Thanks Rob. Yeah not 100% sure what we should allocate. I'm thinking 10.136.0.0/16 for the site seems logical, with 10.136.12... [19:16:00] 10netops, 10DNS, 10Infrastructure-Foundations, 10ops-drmrs: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10BBlack) CC @MMandere as well once we have a decision on the IP prefixes here! [23:11:57] (VarnishTrafficDrop) firing: 60% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [23:16:57] (VarnishTrafficDrop) resolved: 60% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org