[07:10:01] <XioNoX>	 herron: thanks! I'll have a look
[07:13:24] <elukey>	 hi folks!
[07:13:36] <elukey>	 the pki* errors are probably due to a change that I merged yesterday
[07:20:06] <marostegui>	 I am also seeing some errors on a new reimaged host related to debmonitor ssl certs generation: https://phabricator.wikimedia.org/P18798 
[07:20:11] <marostegui>	 should I create a task about this?
[07:20:59] <elukey>	 marostegui: o/ lemme merge a change to see if it was me 
[07:21:06] <marostegui>	 ok :)
[07:28:36] <marostegui>	 elukey: after your merge, it all went fine with a manual puppet run
[07:29:16] <elukey>	 yeah I was about to say, I tested it on cp4034
[07:30:08] <XioNoX>	 about the homer issue, we're hitting this upstream "wontfix" bug since the libs upgrade: https://github.com/paramiko/paramiko/issues/1961#issuecomment-1008119073
[07:30:32] <elukey>	 marostegui: let's leave puppet to auto-resolve the warnings, everything seems ok-ish at this point
[07:37:16] <marostegui>	 sure
[08:00:30] <_joe_>	 elukey: so you solved the pki thing?
[08:00:39] <_joe_>	 because it broke all reimages yesterday 
[08:01:18] <_joe_>	 I'm a bit worried we can end up having a PKI in a non-working state for almost a day
[08:01:38] <_joe_>	 on kubernetes we're relying on our PKI to issue certificates that have a 1 day duration
[08:05:05] <_joe_>	 I'm also worried no one intervened after the breakage was noticed yesterday, nor called someone else if they didn't feel confident fixing the problem. All this to say: 1) we're not prohibited from touching someone else's work 2) if someone else's work is broken and you're not confident fixing it as it's non-obvious, you should feel free to page people
[08:07:20] <elukey>	 _joe_ yes the multiroot ca daemon was down, the config change was missing an "expiry" field. I recall that the puppet run on the pki* intermediate ca node ran fine, but I didn't check the logs of the daemons. 
[08:08:10] <_joe_>	 elukey: ok, so my questions are: 1) why don't we page on multirootca being down 2) why no one thought it could be important.
[08:09:17] <elukey>	 _joe_ 1) seems good, we were probably waiting to reach critical mass before proceeding, but it seems a good thing to do. For 2) not sure :)
[08:09:59] <_joe_>	 elukey: yeah :/
[08:50:14] <jbond>	 for some reason irc pings are not working so i missed the pki issue.  i just took a quick look and things seem to be working now.  ill look further today at the root cause and make sure alerts like this are paging.  however i for now i need to pop for a drs appointment so will cxheck in an hour or so
[09:40:58] <_joe_>	 jbond: yeah np
[09:41:17] <_joe_>	 I'm more worried about the second part of my comment
[10:38:16] <hnowlan>	 continuing restbase reimages - might be some flapping during the day about instance-data partitions filling but that's somewhat expected, I'll try to manage downtimes
[10:47:41] <godog>	 heads up, I'm starting to move graphite back to eqiad in T299383 starting with reads
[10:47:42] <stashbot>	 T299383: Move graphite back to eqiad - https://phabricator.wikimedia.org/T299383
[10:47:48] <godog>	 will be done before the backport window
[10:59:09] <XioNoX>	 fyi, Homer issue has been fixed.
[11:13:31] <godog>	 all done with graphite, standing by to check 
[11:22:36] <XioNoX>	 CI is broken?
[11:23:20] <XioNoX>	 all the jenkins jobs for at least puppet fail with this error: https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/38088/console
[11:23:43] <XioNoX>	 error during compilation: Could not find resource 'Labstore::Nfs_mount[project-on-labstore-secondary]' in parameter 'require' (file: /srv/workspace/puppet/modules/puppet_compiler/manifests/init.pp, line: 35) on node 62d850407768.integration.eqiad.wmflabs
[11:25:17] <RhinosF1>	 dcaro_pto: ^
[11:25:42] <XioNoX>	 arturo then maybe if dcaro is on PTO? :)
[11:26:01] <arturo>	 XioNoX: jbond is going to fix that error soon
[11:26:10] <arturo>	 is related to a recent PCC change
[11:26:22] <XioNoX>	 ok! as long as people are aware
[11:27:04] <arturo>	 for the time being, if that's the only CI error, perhaps overriding jenkins-bot is OK
[11:29:33] <XioNoX>	 ideally wait if it can. or at least use PCC (if it works) to reduce the risk of breaking stuff
[12:42:51] <vgutierrez>	 pcc seems to be working
[12:43:33] <vgutierrez>	 _joe_: BTW, yet another (downstream) timeout for envoy https://gerrit.wikimedia.org/r/c/operations/puppet/+/755338/1
[12:43:44] <vgutierrez>	 maybe it's time to come with that CR gathering timeouts
[12:43:54] <vgutierrez>	 will submit it later today
[13:00:48] <jbond_>	 wow 3 hours in a drs waiting room, a record for me i think.  anyway back, looking at CI issues now 
[13:16:53] <jbond>	 XioNoX: arturo: vgutierrez: ebernhardson: fyi i have fixed CI and rebased your changes let me know if things are still failing
[13:17:08] <XioNoX>	 thanks!
[13:49:40] <MichaelG_WMDE>	 Hi, in the #wikimedia-analytics channel we noticed that neither systemd nor rsync expand glob patterns like `*`, and so one such command was broken. We're currently looking into fixing it there.
[13:49:48] <MichaelG_WMDE>	 Grepping through puppet/modules revealed two more usages of that:  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/openstack/base/keystone/fernet_keys.pp#39 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/microsites/os_reports.pp#19
[13:49:55] <MichaelG_WMDE>	 They might be broken as well?
[13:50:20] <MichaelG_WMDE>	 I created https://phabricator.wikimedia.org/T299519 and https://phabricator.wikimedia.org/T299520
[13:50:40] <MichaelG_WMDE>	 Though they are both just stubs
[18:18:16] <hnowlan>	 XioNoX: your change to netops::monitoring might have broken puppet on alert1001 
[18:18:27] <hnowlan>	 which affects reimages 
[18:18:35] <XioNoX>	 hnowlan: ah?
[18:19:29] <hnowlan>	 I'm seeing "parameter 'ipv6' expects a value of type Undef or String, got Tuple (file: /etc/puppet/modules/netops/manifests/monitoring.pp, line: 177)" which I think relates to the changes to common.yaml from string to list 
[18:21:36] <XioNoX>	 weird that it impacts alert1001 though
[18:24:24] <XioNoX>	 (looking)
[18:27:29] <XioNoX>	 I think I see the issue, trying to solve it, otherwise I'll roll back the changes
[18:27:48] <hnowlan>	 thanks! 
[18:33:13] <XioNoX>	 hmm, I'll have to rollback as solving that is above my puppet skills
[18:36:02] <cdanis>	 ahhh
[18:49:41] <XioNoX>	 hnowlan: thanks to cdanis' help problem is solved!
[18:50:02] <hnowlan>	 great, thank you both! 
[18:50:33] <cdanis>	 it's good that we made puppet failures much more quieter in alerting but for some hosts it really should still be noisy
[18:50:38] <cdanis>	 thanks for noticing hnowlan :)
[19:04:28] <andrewbogott>	 herron ok for me to merge 'logstash: set logstash-json-tcp monitoring to non-critical' ?
[19:04:50] <herron>	 andrewbogott: yes please do!
[19:05:05] <andrewbogott>	 done
[19:05:10] <herron>	 andrewbogott: thx
[21:11:16] <Krinkle>	 _joe_: fyi, aaron and I are now trying to repro and narrow down the apparent bottleneck in php-apcu. If I recall correctly, benchmarks showed after a certain threadahold of traffic things started to get locked up waiting for apcu to release read or write locks etc.
[21:11:37] <Krinkle>	 in relation to mw-on-k8s benches and tuning etc.