[07:10:01] herron: thanks! I'll have a look [07:13:24] hi folks! [07:13:36] the pki* errors are probably due to a change that I merged yesterday [07:20:06] I am also seeing some errors on a new reimaged host related to debmonitor ssl certs generation: https://phabricator.wikimedia.org/P18798 [07:20:11] should I create a task about this? [07:20:59] marostegui: o/ lemme merge a change to see if it was me [07:21:06] ok :) [07:28:36] elukey: after your merge, it all went fine with a manual puppet run [07:29:16] yeah I was about to say, I tested it on cp4034 [07:30:08] about the homer issue, we're hitting this upstream "wontfix" bug since the libs upgrade: https://github.com/paramiko/paramiko/issues/1961#issuecomment-1008119073 [07:30:32] marostegui: let's leave puppet to auto-resolve the warnings, everything seems ok-ish at this point [07:37:16] sure [08:00:30] <_joe_> elukey: so you solved the pki thing? [08:00:39] <_joe_> because it broke all reimages yesterday [08:01:18] <_joe_> I'm a bit worried we can end up having a PKI in a non-working state for almost a day [08:01:38] <_joe_> on kubernetes we're relying on our PKI to issue certificates that have a 1 day duration [08:05:05] <_joe_> I'm also worried no one intervened after the breakage was noticed yesterday, nor called someone else if they didn't feel confident fixing the problem. All this to say: 1) we're not prohibited from touching someone else's work 2) if someone else's work is broken and you're not confident fixing it as it's non-obvious, you should feel free to page people [08:07:20] _joe_ yes the multiroot ca daemon was down, the config change was missing an "expiry" field. I recall that the puppet run on the pki* intermediate ca node ran fine, but I didn't check the logs of the daemons. [08:08:10] <_joe_> elukey: ok, so my questions are: 1) why don't we page on multirootca being down 2) why no one thought it could be important. [08:09:17] _joe_ 1) seems good, we were probably waiting to reach critical mass before proceeding, but it seems a good thing to do. For 2) not sure :) [08:09:59] <_joe_> elukey: yeah :/ [08:50:14] for some reason irc pings are not working so i missed the pki issue. i just took a quick look and things seem to be working now. ill look further today at the root cause and make sure alerts like this are paging. however i for now i need to pop for a drs appointment so will cxheck in an hour or so [09:40:58] <_joe_> jbond: yeah np [09:41:17] <_joe_> I'm more worried about the second part of my comment [10:38:16] continuing restbase reimages - might be some flapping during the day about instance-data partitions filling but that's somewhat expected, I'll try to manage downtimes [10:47:41] heads up, I'm starting to move graphite back to eqiad in T299383 starting with reads [10:47:42] T299383: Move graphite back to eqiad - https://phabricator.wikimedia.org/T299383 [10:47:48] will be done before the backport window [10:59:09] fyi, Homer issue has been fixed. [11:13:31] all done with graphite, standing by to check [11:22:36] CI is broken? [11:23:20] all the jenkins jobs for at least puppet fail with this error: https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/38088/console [11:23:43] error during compilation: Could not find resource 'Labstore::Nfs_mount[project-on-labstore-secondary]' in parameter 'require' (file: /srv/workspace/puppet/modules/puppet_compiler/manifests/init.pp, line: 35) on node 62d850407768.integration.eqiad.wmflabs [11:25:17] dcaro_pto: ^ [11:25:42] arturo then maybe if dcaro is on PTO? :) [11:26:01] XioNoX: jbond is going to fix that error soon [11:26:10] is related to a recent PCC change [11:26:22] ok! as long as people are aware [11:27:04] for the time being, if that's the only CI error, perhaps overriding jenkins-bot is OK [11:29:33] ideally wait if it can. or at least use PCC (if it works) to reduce the risk of breaking stuff [12:42:51] pcc seems to be working [12:43:33] _joe_: BTW, yet another (downstream) timeout for envoy https://gerrit.wikimedia.org/r/c/operations/puppet/+/755338/1 [12:43:44] maybe it's time to come with that CR gathering timeouts [12:43:54] will submit it later today [13:00:48] wow 3 hours in a drs waiting room, a record for me i think. anyway back, looking at CI issues now [13:16:53] XioNoX: arturo: vgutierrez: ebernhardson: fyi i have fixed CI and rebased your changes let me know if things are still failing [13:17:08] thanks! [13:49:40] Hi, in the #wikimedia-analytics channel we noticed that neither systemd nor rsync expand glob patterns like `*`, and so one such command was broken. We're currently looking into fixing it there. [13:49:48] Grepping through puppet/modules revealed two more usages of that: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/openstack/base/keystone/fernet_keys.pp#39 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/microsites/os_reports.pp#19 [13:49:55] They might be broken as well? [13:50:20] I created https://phabricator.wikimedia.org/T299519 and https://phabricator.wikimedia.org/T299520 [13:50:40] Though they are both just stubs [18:18:16] XioNoX: your change to netops::monitoring might have broken puppet on alert1001 [18:18:27] which affects reimages [18:18:35] hnowlan: ah? [18:19:29] I'm seeing "parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177)" which I think relates to the changes to common.yaml from string to list [18:21:36] weird that it impacts alert1001 though [18:24:24] (looking) [18:27:29] I think I see the issue, trying to solve it, otherwise I'll roll back the changes [18:27:48] thanks! [18:33:13] hmm, I'll have to rollback as solving that is above my puppet skills [18:36:02] ahhh [18:49:41] hnowlan: thanks to cdanis' help problem is solved! [18:50:02] great, thank you both! [18:50:33] it's good that we made puppet failures much more quieter in alerting but for some hosts it really should still be noisy [18:50:38] thanks for noticing hnowlan :) [19:04:28] herron ok for me to merge 'logstash: set logstash-json-tcp monitoring to non-critical' ? [19:04:50] andrewbogott: yes please do! [19:05:05] done [19:05:10] andrewbogott: thx [21:11:16] _joe_: fyi, aaron and I are now trying to repro and narrow down the apparent bottleneck in php-apcu. If I recall correctly, benchmarks showed after a certain threadahold of traffic things started to get locked up waiting for apcu to release read or write locks etc. [21:11:37] in relation to mw-on-k8s benches and tuning etc.