[06:07:23] <_joe_> CRITICAL: 1 puppet certs need to be renewed: [06:07:25] <_joe_> crit: dbstore2001.codfw.wmnet [06:07:52] <_joe_> can you take care of that? ^^ [06:12:29] _joe_: jbond mentioned it yesterday and was going to look. That server long doesn't exist. [06:12:56] About 14:20 UK time in here [06:13:06] <_joe_> RhinosF1: I am aware, most probably when the server was decommissioned the cert was left over [06:17:49] most likely. I believe John was going to clean it up and make sure no other old hosts remain. I assume he hasn't started work though for today yet. [07:32:04] ugh, C2 at 18:30 UTC today? [07:32:37] it is a bit crazy yes [07:33:26] * Emperor will have to talk nicely to other half, that's going to bugger up our dinner plan [07:33:28] I am sending an email now as I am starting my holidays today [07:33:41] And there's no way I am going to be up that late [07:33:47] I was already working till 18:30utc [07:33:51] yesterday [07:39:36] email sent [07:42:23] * Emperor is also away on Friday [09:10:49] jynus: db2099:3314 I haven't touched that one [09:11:06] I know [09:11:24] just it doesn't make sense to have a notification if it is only me and 1 host :-D [09:29:37] there has been an increase in access denied errors on most s6 core eqiad insatnces since around 8utc (very low rate, but constant) [09:29:59] that could be: https://phabricator.wikimedia.org/T314528 [09:30:17] Do you have an example error? [09:30:31] I was looking at graphs, I can look at logs now [09:31:02] example on 1 host: https://grafana.wikimedia.org/goto/pYOFuPz4k?orgId=1 [09:31:49] It must be related I guess [09:32:09] taavi: ^ are those hosts really gone? https://phabricator.wikimedia.org/T314528#8130359 [09:32:55] I got the link of clients [09:33:08] https://logstash.wikimedia.org/goto/7aa8e35d8d1d325fcd2b6315f56eee3e [09:33:44] labweb1001, labweb1002, maybe mediawiki-pinkunicorn-*? [09:34:05] it is difficult as some may be pdu-maintenance related, so do not trust my debugging [09:34:05] But those hosts are supposed to be gone (per the above task) [09:34:23] logsource [09:34:23] labweb1001 [09:34:26] I trust they should be gone :-D [09:34:27] So it is still trying to connect [09:34:40] I am just sending metrics here 0:-) [09:34:47] Let's see if taavi comes back with some idas [09:34:55] If not i will re add the grants [09:35:29] I saw then because there was an increase in global errors, althgouh it was quite low (but at first I thought it was a general mw issue) [09:36:42] I detected it at first looking at this on the general graph: https://grafana.wikimedia.org/goto/Cd9fXPz4z?orgId=1 but then realized rate was too low for being an outage or something [09:37:07] could be some monitoring or something, idk [09:38:41] actually, I think I saw it on the App server RED monitoring first: https://grafana.wikimedia.org/goto/5b_8uPk4k?orgId=1 [09:40:52] marostegui: i'll look in a bit, please don't revert just quite yet [09:40:57] sure [09:41:28] because the rate looked so stable probably not user-caused, so personally not worried [09:46:07] I think I am going to shutdown db2099 and then move with my laptop somewhere else [09:47:36] do you want me to do it? [09:47:39] so you can move already? [09:48:01] it is ok, don't worry [09:48:46] did you disable notifications on it, or was it like that normally? [09:49:10] I did yes [09:49:18] ok, cool [09:49:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/820371/2/hieradata/hosts/db2099.yaml [09:51:42] marostegui: I think I fixed it, those hosts still had systemd timer running which was trying to connect to the database so I stopped it [09:52:12] ah good, let's see if the errors stop! [09:52:14] thanks [09:54:16] taavi: looks like they've stopped [09:54:23] thanks [09:54:27] great, sorry about that [09:54:42] there is a few remaining ones from mediawiki-pinkunicorn-58f577f89c-qvvr4 to enwiki, but that looks unrelated [09:54:53] but enwiki has been untouched [09:55:38] yeah, it seems k8s codfw, probably a different issue to discusse with service ops, and it is monitoring, so not super-urgent [10:04:25] the graphs are there to help you (us) debug, nothing to apologize for :-) [10:06:13] db2099 down, will now prepare and move in some time [10:49:20] I may have lunch in the way, be back in ~1h [10:58:17] I extended downtime of backup2006, backup2009 24 more hours (c2) [10:58:59] db2126 notified, but that is expected to also happen today (another expiration?) [11:00:01] not too worried about the notifications, just trying to make sense of all the hosts in maintenance/communication/etc. [11:00:48] I think I missed that notification [11:00:51] Let me add it [11:15:19] _joe_: there is now https://phabricator.wikimedia.org/T314564 for the puppet cert issue in case you haven't seen [11:15:39] <_joe_> RhinosF1: thanks [11:15:45] Np [11:19:11] And fixed now [11:19:14] 12:18:54 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [17:30:24] Amir1: ok for me to start up db2102 ? [17:30:47] jynus: which rack it is? [17:30:51] it is a backup test db, but sometimes manuel handles it, so making sure not 2 people handle it at the same time [17:30:55] C5 [17:31:08] sure go ahead [17:31:23] ok, doing [17:31:34] I won't play with it, FYI I'm making a schema change on I think one of backup sources of s4 right now [17:31:44] yeah, this is s1 [17:31:44] which might take around ten hours ish [17:31:56] won't ever touch anything that is core or misc [17:32:01] unless emergency [17:35:51] FYI I think in 1 or 2 Qs I will start recovering automatically to those tests hosts to finish the backup testing procedures, they have alerts disabled all the time [17:43:07] test-s1 codfw catching up [17:43:44] I will focus now on making sure the backups hosts that are coming back up are ok [19:50:59] OK, everything's back after the power work (aside from a bunch of sad hardware that is going to need fixing), so I'm off on holiday, back Tuesday. [19:52:24] backups looking good, I think only A has some pending work tomorrow to startup more dbs [19:52:42] there is some mgmt interfaces complaining, but we can live with that until tomorrow [19:55:16] g nite