[00:09:31] yeah, I heard about that ... we'll see. [03:06:21] It seems alert manager is firing alarms that haven't actually gone off in Grafana [03:06:25] https://grafana.wikimedia.org/d/000000402/resourceloader-alerts?orgId=1 [03:06:42] The "number of minify calls" is curently in `pending` state [03:06:54] it will go fire if it stays this way "For" 4 hours. [03:06:58] (using the "for" option) [03:07:13] I'm guessing the plugin we use to bridge these is confusing the pending state for the firing state? [03:07:24] I [03:07:39] I'll file a task but checking here in case someone's run into this or it's something obvious [05:12:55] kormat: I've briefly stopped and re-started the purge, this time with a tee to a file you can check when you're up so that you don't have to wait for me to be back and ack it being done. [05:13:51] kormat: check /home/krinkle/pc1007-purge-out.txt on mwmaint1002, that should have as last lines that it deals with table pc255. when that's the case, it is done and my tmp.php process no longer running. Should be done in 1-2 hours or so I estimate [05:13:54] * Krinkle is now afk [05:17:33] many are "0 rows" which is fine. [07:13:26] arturo, andrewbogott - o/ cloudvirt200[1,2] have a tiny /boot partition and it is almost full, it should be easy to fix dropping old kernels [08:06:19] Krinkle: looks like it's finished 👍 [10:32:06] T2001 test [10:32:07] T2001: [DO NOT USE] Documentation is out of date, incomplete (tracking) [superseded by #Documentation] - https://phabricator.wikimedia.org/T2001 [10:39:45] <_joe_> volans: I love the throwback reference [10:41:53] :) [10:43:18] There's an alert for HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string 'OK' not found on 'http://checker.tools.wmflabs.org:80/k8s/nodes/ready' - 177 bytes in 0.089 second response time not sure who's the owner for that [10:43:27] is that wmcs? [10:43:43] toolschecker: All k8s worker nodes are healthy [10:43:49] so it is a bit confusing XD [11:22:43] <_joe_> marostegui: yes it's wmcs [11:23:06] Thanks, arturo dcaro do you want me to create a task for your team to follow up with? [11:23:50] I think we are on top of it already. Andrew was upgrading the k8s cluster yesterday, and there is some cleanup left [11:24:07] ah ok, is it ok to ACK the alert then? [11:24:13] yeah! [11:24:16] ok, will do [11:25:01] thanks! [12:32:04] ooof I just filed https://phabricator.wikimedia.org/T283714 and I can reproduce it locally on my bullseye workstation :| [12:32:30] to exclude I botched sth on the cloud vps that is [12:34:19] godog: can confirm that it works for me on a buster VPS [12:35:08] kormat: sigh! thanks for testing though, the plot thickens [12:46:33] jbond: hmm. you were doing some stuff with cfssl and puppet db recently, am i right? [12:46:58] kormat: yes [12:47:04] i have a puppetdb in pontoon which is sad, because the cfssl .pem file has zero bytes [12:47:12] and i'm not sure what to do about that [12:47:21] ah one sec i know what that is [12:48:13] kormat: which server may need to fix it manually [12:48:23] `puppetdb.mariadb104-test.eqiad1.wikimedia.cloud` [12:49:00] it's always a bit awkward when puppet breaks puppet, meaning you can't use puppet to fix it :) [12:49:53] yes :) [12:50:04] shuld be fixed now sorry about that [12:50:18] fyi i fixed it by just removing the puppet-microsite (which is not needed in cloud) [12:50:42] might have spoken to soon [12:51:39] i tried running puppet, then it stopped working. might be related [12:52:59] no it seems pontton or this project has the following set somewhere [12:53:00] profile::puppetdb::microservice::enabled: true [12:55:46] i don't see anything in horizon [12:56:16] could be this: [12:56:17] `hieradata/common/profile/puppetdb/microservice.yaml:profile::puppetdb::microservice::enabled: true` [12:56:37] ahh yes pontoon use the productions hiera tree [12:56:56] i'll make a test change in my branch [12:57:13] /win 20 [12:57:14] /win 20 [12:57:17] oh come on [12:57:19] i think i can poverride it in the pontton.yaml file [12:57:21] /lose 40 [12:57:57] kormat: to fix manualy `rm /etc/nginx/sites-enabled/puppetdb-microservice` then start nginx [12:59:09] kormat: FWIW the bug is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971530 [12:59:38] jbond: looks like this is working. i'll send a pontoon CR. [12:59:52] kormat: hand fire jusy submitted [12:59:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/695274 [12:59:58] godog: kormat: ^ [13:01:23] jbond: LGTM, though I don't know yet what's the puppetdb microsite ? [13:01:47] jbond: wait [13:02:07] i made a comment; i should have -1'd, as i see people are quick on the trigger today :) [13:02:13] tyes i see [13:02:24] ok :) [13:03:03] ah yes, thank you kormat [13:03:36] what's the microsite btw ? [13:04:49] godog: its esentially a filter for the puppetdb api to make sure we dont expose any endpoints witch have secrets [13:05:01] kormat: can you recheck [13:05:37] godog: https://github.com/wikimedia/puppet/blob/production/modules/profile/files/puppetdb/puppetdb-microservice.py [13:06:20] merged [13:06:28] jbond: interesting, thank you [13:07:23] but yeah certainly there's no reason (other than the certificates, as kormat found out) why microsite can't be enabled by default in pontoon too [13:07:33] not a box I want to open now though, just pointing it out [13:08:43] godog: yes i agree, i shouldn';t have broken the cloud compatibility of that class however as its not really needed and most of cloud dosn;t need puppetdb i figuered easier just to disable iot there. [13:09:22] however for pontton i think it wiykd make sens to add pontoon as a pki clint to the pki project [13:09:25] https://wikitech.wikimedia.org/wiki/PKI/Cloud [13:09:40] afaik only i have followed ^^ so there could be some holes [13:10:49] * jbond also needs to create the discovery and debmonitor intermidiates in cloud) [13:11:21] jbond: agreed, pontoon should be a pki client, I'll bug/ping you for sure when I'll be attempting that [13:11:35] ack [13:15:23] jbond: i have another pontoon issue that's probably in your area :) the idp client profile is trying to connect to acmechief1001, and failing [13:15:26] is that a new change? [13:17:50] kormat: not a new change, thats because opcastrator is confiugered with an acme_chief certificate [13:19:13] kormat: you probably need to redefine profile::idp::client::httpd::sites: with different values [13:19:42] (if you just drop the acme_chief_cert param it will not try and fetch anything [13:20:45] also im not sure how automitic service registration is in cloud. i think if you create a new proxied site for orchestrator.wmcloud.org it may well DTRT but may need some tweeks [13:21:20] also the ldap DB for the cloud idp instance is limited and you will need to create an account manually if you want to test [13:21:42] feel free to ping me for more info on any of this :) [14:03:47] shdubsh: hi! I'm setting up central logging for WMSC ceph cluster, and as I saw your email about ECS I wanted if possible just start using that schema right away, do you have a few min to guide on what's needed to adapt it? [14:03:51] *adopt [14:19:28] dcaro: sure! how can I help? :) [14:23:00] shdubsh: okok, so I have this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/695329 that (I think) enables shipping the logs, but it does not do anything to the "format" of the logs, so my guess is that the mapping from whatever that sends to ECS is done somewhere else? [14:24:28] so the question is, where/how should I configure that? (I was looking around, and elastic seems to point to filebeat integration with ceph, but I think that's not how we do it, given it's using syslog) [14:24:40] that's right. that patch will configure rsyslog to forward the logs and apply some basic formatting in another config file. [14:27:46] hmm, it doesn't look like ceph can do structured logging? [14:28:07] I can investigate, but might be tricky yep [14:29:18] for now unstructured is all I have [14:30:00] that's ok, it doesn't look like there's much extra info to extract from the logs anyway. [14:30:51] so, that patch is all that is necessary to get the logs initially flowing. after that, we'll apply another step to get them into the right format. [14:31:45] nice, I'll get that patch going, can you elaborate on waht the next step is? (maybe I can start working on it while getting that patch in) [14:33:01] once https://gerrit.wikimedia.org/r/c/operations/puppet/+/689160 is merged, the next step should be as simple as adding "ecs_170" to lookup_table_output.json [14:33:36] * shdubsh is working on testing the patch [14:34:30] awesome, note that not all the logs are enabled yet on ceph side, so there might be samples missing (in case you are looking into any specific sample) [14:49:03] jynus: FYI I tested the change in the reimage script and seems to work fine [14:49:07] 14:39:43 | sretest1002.eqiad.wmnet | The host rebooted into the Debian installer [14:49:10] 14:45:30 | sretest1002.eqiad.wmnet | The host rebooted into the newly installed Operating System [14:49:19] cool :-) [14:49:20] (with other steps in the middle ofc) [14:49:21] thank you! [16:24:34] hello folks, before I self-nerd-snipe myself into checking the appserver's codfw latency, is there anybody that worked on it and/or opened a task? :D [16:29:27] elukey: a hunch: the latency alerts need to be modified to take into account "high average latency is expected when the number of rps is very low" [16:32:33] cdanis: hello! But in theory we have only health checks going through codfw, IIUC some of those request take longer than expected under some circumstances [16:32:59] never seen that alert firing before (it may have changed recently and I wasn't aware) [16:38:23] I'm not sure what has changed either but the codfw appserver latency alert has been flapping for some time now [16:38:54] it seems few days from what I saw, has it been ongoing for more? [16:39:02] that sounds right [16:44:32] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-30d&orgId=1&to=now&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=GET [16:44:41] something did change in the last week, but also, it seems totally unconcerning [16:47:35] probably it is, it would be nice to get to the bottom of it and ideally not alert if not needed [16:50:41] don't have time to dive into it myself but I'll open a task [16:55:52] filed https://phabricator.wikimedia.org/T283744 [17:00:07] <3 will try to follow up tomorrow [17:35:38] <_joe_> cdanis: i think it's usually related to schema changes [17:35:41] <_joe_> that slowness [17:35:51] <_joe_> also [17:35:54] <_joe_> do [17:35:56] <_joe_> not [17:35:58] <_joe_> look [17:36:00] <_joe_> at [17:36:02] <_joe_> dashboards [17:36:04] <_joe_> :D [17:37:42] <_joe_> (that was for elukey ofc) [17:38:05] _joe_: sure, but it's causing icinga spam :) [17:40:43] what do you mean do not look at dashboards? :D [17:42:54] elukey: I learned this as "Steve's Maxim" -- any time you look at a dashboard, you'll find something, so unless you're looking for something in particular, *don't look* [17:44:06] yeah, so much that [17:45:06] "I have an unknown problem, let me stare at 57 different graphs and find a correlation that will lead me out of the woods" is an anti-pattern. it's like reading tea leaves. [17:46:02] nono wait in this case I just opened the latency graph that the alerts points to, one needs to start from something [17:46:17] I wasn't checking the appservers red dashboard randomly [17:46:22] sure [17:46:36] I mean, the links are useful, because they give you the stats context on what was being alerted on. [17:46:52] but in the general case, I just meant you have to come at the stats with a hypothesis in mind. [17:47:55] yes yes exactly, I checked the last time on one host and noticed some health checks taking longer than expected, and stopped right there since I didn't have time.. I asked the chan before taking another look to know if anybody went already in the rabbit hole, that's it :D [17:48:03] e.g. "I think, based on intuition and/or evidence, that probably X is going on. If X is going on, I'd probably observe phenomenom Y in graph Z, let's go look and see if it's there". As opposed to "I have no idea what's wrong, but something alerted. I will look at the graphs and try to divine a hypothesis from the stats data" [17:50:25] sure makes sense [17:50:47] we struggled with this for a long time in the traffic team a few years back [17:51:14] it's hard to avoid doing it when you're short on good hypotheses and long on infinite stats graphs to stare at