[07:03:59] FYI; I'm rebooting apt1001 in a few minutes, this doesn't have practical impact on running systems, but package imports/syncs won't be possible until it's done [07:15:19] it's back [08:35:59] facter is always messing with me... it's fair to assume that @processorcount and facter -p processors.count is the same? [08:36:40] usually the former is the legacy fact and the latter is the structured fact [08:36:46] to be sure we need to double check i [08:36:48] *it [08:37:47] vgutierrez: https://www.puppet.com/docs/puppet/8/core_facts.html#processorcount and https://www.puppet.com/docs/puppet/8/core_facts.html#processors [08:38:06] s/8/5/ on the links [08:38:15] for the correct vesion, but we're moving to 7 too :D [08:38:36] I'd say yes they are the same [08:49:18] --show-legacy should fix it too [08:49:40] https://www.irccloud.com/pastebin/XSAXy1I6/ [08:49:45] cheers volans [08:52:16] yw :) [09:02:27] I'd like to reboot cumin1001 in half an hour, would that work for everyone? I noticed there's a cumin run which restarts confd in batches, but it should be complete by then AFAICT [09:02:40] in the interim, please use cumin2002 for cookbooks/cumin [09:17:46] !incidents [09:17:46] 4041 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [09:17:47] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [09:17:47] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [09:17:47] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [09:17:47] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [09:17:47] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [09:17:48] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [09:17:48] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [09:38:40] going ahead in 5m [10:00:52] cumin1001 can be used again [10:02:38] !incidents [10:02:39] 4042 (ACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [10:02:39] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [10:02:39] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [10:02:39] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [10:02:39] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [10:02:40] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [10:02:40] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [10:02:40] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [10:02:40] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [11:23:41] I'd like to deploy restbase to disable the wikifeeds alert - last time we did a restbase deploy it had a bit of a woopsie, but the issue of missing nodes etc is resolved so I think it'll be fine [11:31:45] hnowlan: thx for the advance notice, I'll know who to ping if we get paged :-P [11:35:00] volans: :D it's much more likely to spam rather than page, but it's restbase, there is a ~*world of possibility*~ in it [11:36:31] :D [11:37:24] !incidents [11:37:24] 4043 (UNACKED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [11:37:24] 4042 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [11:37:24] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:37:25] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [11:37:25] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [11:37:25] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:37:25] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [11:37:25] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [11:37:26] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [11:37:26] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [11:37:31] !ack 4043 [11:37:31] 4043 (ACKED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [11:37:47] hnowlan: how can I help? [11:38:08] figuring out what's failing in restbase I guess [11:38:16] this change disabled a minor monitoring check :/ [11:38:41] can we revert? [11:38:44] yep [11:41:41] the errors are being caused by envoy failures connecting to mobileapps [11:41:45] I'm not sure reverting will fix this [11:42:15] !incidents [11:42:15] 4043 (ACKED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [11:42:16] 4042 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [11:42:16] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:42:16] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [11:42:16] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [11:42:16] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:42:17] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [11:42:17] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [11:42:17] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [11:42:18] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [11:42:36] This spike in errors starts right with the deploy https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?forceLogin&from=1694691298831&orgId=1&to=1694691719695&var-container_name=All&var-dc=thanos&var-prometheus=k8s&var-service=mobileapps&var-site=eqiad [11:42:41] I'll try the revert anyway [11:43:30] so you think restarting make it lose some cached endpoint that is not able to get back? [11:44:00] jayme: maybe you have some context if anything might have changed wrt envoy and mobileapps endoing? [11:44:20] I'm not sure really - some kind of connection issue between it and the service proxy? Something in mobileapps? [11:44:26] This is similar to the last issue :( [11:44:43] :( [11:45:05] where is the exact log you're seeing? [11:45:28] similar errors in the mobileapps log https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2023.09.14?id=x2aBk4oB2F9ZGV9i4mgR [11:45:44] example from restbase https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2023.37?id=7Ox-k4oBjC2iVgTO9hVc [11:46:09] localhost:6012 is the mobileapps port [11:46:10] not sure about mobileapps in particular but ther have been some generic envoy config changes in the past [11:46:35] I might lack context though and I'm in the middle of reimaging conf2 cluster [11:46:43] transport failure reason: delayed connect error: 111 [11:46:59] does that 111 rings any bell/ is known? [11:47:54] not to me [11:47:56] I'll wait for the rollback to finish. last time a rolling restbase restart fixed things unfortunately [11:47:57] I think 111 is connection refused in envoy speak [11:48:52] I see that my cloak request was rejected: "cloak for brouberol requested 4 hours ago is rejected". Is there a way for me to understand why it was rejected in the first place? [11:49:02] missing server hostname/address? [11:49:51] rollback done, will wait another minute before just doing restarts [11:49:57] ack [11:52:57] I'm not seeing recovery on grafana, checking lgstash [11:53:22] I'd say the same, but I'll let you decide hnowlan [11:53:29] volans: might be, but the only requested data is username casing and cloak type (first request, refresh, etc). Anyway, it's not a big deal and you're dealing w/ an incident atm. I'll circle back later [11:53:29] yeah, getting ready to restart [11:53:49] our recommended depool-restbase script uses the restbase-https service and so doesn't work, oops, gotta fix that [11:54:43] brouberol: that reply was not for you ;) [11:54:55] hnowlan: :( [11:55:00] might be related to a service restart, but that host is depooled [11:55:09] ah, /facepalm [11:55:22] ah, no, that's just a general increase since [11:55:55] is there a cookbook for restbase service restarts? I can see one [11:55:57] *can't [11:56:35] hnowlan: | |-- sre.misc-clusters.roll-restart-restbase: Cookbook to perform a rolling restart of Restbase ? [11:56:43] brouberol: you should ask about that in #wikimedia-ops [11:57:11] hnowlan: unfortunately not mentioned in https://wikitech.wikimedia.org/wiki/Service_restarts#restbase [11:57:19] yeah [11:57:20] recovery now [11:57:22] not sure if it depools either [11:57:41] the batach classes do, let me quickly check [11:58:14] yes uses the batch classes, so downtime and lvs depool are included [11:58:26] hnowlan: it restars the 'restbase' systemd unit [11:58:47] hnowlan: what did you do? I'm seeing recoveries [11:58:55] volans: nothing :| [11:58:59] !incidents [11:58:59] 4043 (ACKED) ProbeDown sre (10.2.1.17 ip4 restbase-https:7443 probes/service http_restbase-https_ip4 codfw) [11:59:00] 4044 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [11:59:00] 4042 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service eqiad) [11:59:00] 4041 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:59:00] 4039 (RESOLVED) HaproxyUnavailable cache_text global sre () [11:59:00] 4038 (RESOLVED) VarnishUnavailable global sre (varnish-text) [11:59:01] 4040 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [11:59:01] 4037 (RESOLVED) [7x] ProbeDown sre (probes/service) [11:59:01] 4036 (RESOLVED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [11:59:02] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [11:59:02] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [11:59:19] mobileapps still seeing very few requests https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?forceLogin&from=now-3h&orgId=1&to=now&var-container_name=All&var-dc=thanos&var-prometheus=k8s&var-service=mobileapps&var-site=eqiad [11:59:33] yeah, something got better, but not fully back [11:59:43] I'll run the cookbook [12:00:08] ack [12:00:10] +1 [12:00:47] strangely, 50% of POSTS are back, gets still at almost 0 [12:01:29] maybe internal traffic recoverted but not external, or something like that? [12:03:50] what's the public impact AFAWK? should we post on the status page? [12:04:12] has been already a bit [12:04:24] marostegui: thoughts? [12:04:36] mobileapps APIs will definitely be affected [12:04:37] https://www.mediawiki.org/api/rest_v1/page/html/Project%20talk%3AMastodon?redirect=false 502ing is related to what y'all are discussing, right? [12:04:38] I've been seeing errors refering mostly from panels, but only uncached ones [12:05:01] oh dear, this cookbook is very slow [12:05:04] math renders [12:05:35] hnowlan: did you tune any of the sleeps? [12:05:53] based on NEL- so a limited impact to editors would be useful, I think, volans [12:05:57] the grace-sleep default though is 5 [12:05:58] volans: no [12:06:33] volans: I don't know, I think service owners should tell us whether it is worth posting or not :) [12:07:19] if we could figure out this error I think we'd solve faster https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2023.37?id=yZeWk4oBt3kDdlqnJd_9 [12:08:02] restart hasn't solved it [12:08:34] hnowlan: what's upstream in this case? [12:08:43] I can repro [12:08:44] $ curl http://localhost:6012/en.wikipedia.org/v1/page/talk/Salt [12:08:44] upstream connect error or disconnect/reset before headers. reset reason: connection termination [12:08:49] the discovery service for mobileapps [12:09:31] could/should we roll restart it? it's the thing serving errors according to grafana (although there's no explanation for why a restbase rollout would affect it) [12:09:56] $ curl "https://mobileapps.discovery.wmnet/" [12:09:56] curl: (7) Failed to connect to mobileapps.discovery.wmnet port 443: Connection refused [12:09:59] grasping at straws a bit, I don't really understand mobileapps [12:10:19] volans: port 4102 afair [12:10:31] I think we should at least update the status page to say APIs are impacted [12:10:51] ok I get a body with an error I guess I'd need a valid URL [12:10:57] Cannot GET / [12:11:08] Articles aren't loading on the wikipedia mobile app [12:11:13] I suppose it's related [12:11:29] <_joe_> yes it is. [12:11:34] marostegui: let's post on the status page [12:11:36] volans: curl https://mobileapps.discovery.wmnet:4102/en.wikipedia.org/v1/page/talk/Salt [12:11:37] <_joe_> if someone deployed restbase, rollback [12:11:40] volans: yeah, agree [12:11:41] _joe_: already done [12:11:44] no effect [12:12:58] LOL from wikitech: Mobile Content Service is slated for service decommissioning in July 2023. [12:12:58] volans: I just posted [12:13:03] marostegui: thx! <3 [12:13:28] volans: the service restart for it lists scb too heh [12:13:45] :facepalm: [12:13:58] <_joe_> volans: it's not "mobileapps" [12:14:18] <_joe_> hnowlan: yeah try to roll-restart mobileapps, although I doubt it's the culprit [12:14:22] ack [12:14:30] <_joe_> question: do we have the same issues in eqiad and codfw? [12:14:48] <_joe_> has anyone looked at logstash? [12:14:51] codfw shows some normal looking traffic to mobileapps at least [12:15:00] <_joe_> uhm [12:15:46] * mbsantos enters the restbase war room [12:16:15] reported errors are now going down fast since :10 [12:16:24] <_joe_> hnowlan: uh what port is restbase listening to? [12:16:49] _joe_: 7231 and 7233 I think [12:16:58] <_joe_> 7443 is envoy [12:17:18] <_joe_> ok hnowlan tell me what I'm doing wrong [12:17:26] <_joe_> $ curl https://restbase.svc.eqiad.wmnet:7443/en.wikipedia.org/api/v1/page/html/Australia [12:17:28] was tricked by a truncated graph, they are not going down [12:17:29] <_joe_> curl: (7) Failed to connect to restbase.svc.eqiad.wmnet port 7443: Connection refused [12:17:52] <_joe_> oh sigh 7433 [12:18:01] <_joe_> no, I was right [12:18:12] <_joe_> so right now restbase seems not to be reachable via envoy [12:19:07] mobileapps roll restart in eqiad didn't do anything [12:19:12] <_joe_> but from a single server, I can do it just fine [12:19:17] interestingly codfw is looking healthier, errors down https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?forceLogin&from=now-1h&orgId=1&to=now&var-container_name=All&var-dc=thanos&var-prometheus=k8s&var-service=mobileapps&var-site=codfw [12:19:42] <_joe_> hnowlan: something is wrong with restbase in eqiad I would say [12:19:56] _joe_: yep, although not the service itself [12:20:32] <_joe_> hnowlan: I see a lot of errors but they ended some time ago [12:20:51] stupid question but we don't use this any more correct? https://config-master.wikimedia.org/pybal/eqiad/restbase-https [12:21:19] <_joe_> hnowlan: sigh [12:21:22] <_joe_> wat [12:21:30] <_joe_> we do use it [12:21:32] <_joe_> that's the issue [12:21:33] wtf [12:21:35] <_joe_> fixing [12:21:44] how did that happen? [12:21:46] ??!?!?! [12:22:01] wtaf [12:22:23] same in codfw [12:22:32] also don't we have a max depooled hosts limit? [12:23:03] <_joe_> volans: in pybal yes, but clearly too many were already down [12:23:25] <_joe_> ok, please find in SAL who did any change to conftool, else I'll look at the etcd audit log [12:23:41] <_joe_> I bet things work again don't they [12:23:41] sure but some traffic should pass anyway [12:23:57] _joe_: I think we need to look at the logs because the restart cookbooks also touched them [12:24:07] the reason there was traffic in codfw was because this host was pooled https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=restbase2023&var-datasource=thanos&var-cluster=restbase [12:24:10] to exclude it was caused by it [12:24:21] graph recovering [12:24:26] Apart from the confd restart from this morning I'm not seeing actions [12:24:35] <_joe_> confd would be unrelated [12:24:44] I figured [12:24:57] maybe de deploy hit some bug? [12:25:03] Should I change the status page back? [12:25:04] triggered [12:25:08] but there was also a pybal restart today? [12:25:23] marostegui: give it a minute I'd say [12:25:36] marostegui: you can advance to when it says detected [12:25:41] "but monitoring" [12:25:50] right [12:26:02] <_joe_> 13:28:56 ? [12:26:06] good idea [12:26:16] <_joe_> ah no sorry [12:26:22] in the future? :D [12:26:38] <_joe_> 11:27:47 [12:26:46] <_joe_> and around that time [12:26:57] <_joe_> I'd say it's a deployment which did depool the servers but not repool them [12:27:01] <_joe_> if I had to bet [12:27:04] 11:25 hnowlan@deploy1002: Started deploy [restbase/deploy@e8a6ae4]: Disable wikifeeds announcements healthcheck [12:27:10] <_joe_> yep [12:27:18] that's way before the cookbook run [12:27:30] <_joe_> 11:25:31 is restbase1016.eqiad.wmnet [12:27:35] <_joe_> which is the canary IIRC [12:27:37] yep [12:27:39] page was at 11:37:07 [12:27:40] <_joe_> ok [12:27:47] <_joe_> hnowlan: so scap3 doing damages [12:27:58] <_joe_> hnowlan: wanna move restbase to k8s? :P [12:28:02] did it error out, show some logs that was not repooling things? [12:28:12] I believe I shall just stop deploying restbase [12:28:20] _joe_: yes :( [12:28:44] is there a chance the hosts were all disabled before the deploy and pybal was refusing to remove them? [12:28:49] <_joe_> no [12:28:57] <_joe_> it's clear it happened with the deployment [12:29:01] aha [12:29:13] (wat) [12:29:31] <_joe_> so we need to go check what this deployment does [12:29:34] <_joe_> and why it failed [12:29:58] yeah trigger vs root cause [12:30:01] <_joe_> but for I'd take my post-lunch break [12:30:55] I think traffic is going back to normal leves, there was a spike for backlog after the fix [12:31:17] <_joe_> hnowlan: can you save your deployment backscroll? [12:31:27] <_joe_> we can dive into it later [12:31:36] I'll see if I still have it [12:31:54] yeah I got it [12:32:35] let me double check NEL for closing the status page [12:32:51] apologies for the noise anyway [12:33:00] heh ;_; [12:33:45] marostegui: volans no more errors on NEL (5XX errors) either [12:33:58] updating the status page [12:34:05] +1 [12:34:22] ty [12:35:25] nothing notable in the scrollback unfortunately https://phabricator.wikimedia.org/P52507 [12:36:00] Yeah I was looking at the scap log on deploy1002 and there's nothing useful either [12:39:57] I wonder if there's been any major scap changes and/or whether this could happen with other services [12:41:02] hnowlan: Sep. 7th: Installation of scap version "4.59.0" completed for 594 hosts [12:41:51] sorry, I missed some after that [12:42:03] 4.61.0 on Sep. 12th is the latest [12:42:16] if I'm picking the correct "scap" [12:44:57] at first sight the last commits seems unrelated [12:45:05] but I just glimpsed them [12:47:08] yeah not seeing a lot [12:49:02] so what I would really like to know - did my last attempt to deploy restbase a week or two ago that also caused issues see the same behaviour? [12:49:46] because I manually depooled, restarted and then repooled nodes - which I imagine would fix the "enabled: false" behaviour [12:51:19] bbiab, will look further then [12:52:02] looking at pybal log (which I'm not an expert in so could be wrong), they just failed one after the other and got disabled [12:52:17] I'm seeing no logs pertaining to the pool limit [13:03:21] Hey! Is there someone with +2 rights on ircservserv-config to review and merge 2 patches: https://gerrit.wikimedia.org/r/q/owner:guillaume.lederrey%2540wikimedia.org+project:wikimedia/irc/ircservserv-config+status:open [13:04:52] * TheresNoTime has the buttons, but unsure if they *should*.. [13:05:01] s/should/should +2 the change [13:06:40] ah, okay, looks ok — will +2 [13:32:49] TheresNoTime: Thanks! I think that you also need to reload the config [14:00:00] * urandom reads the restbase backscroll [14:06:19] I'd planned to continue bootstrapping Cassandra on 1030-c, but now I'm properly terrified 🤔 [14:08:01] urandom: I'm 99% certain the above issues don't have much to do with cassandra [14:08:31] I know, but restbase is a fragile flower that melts down if you look at wrong [14:08:39] at it, wrong [14:10:45] _joe_: how did you verify that the conftool disabling happened during that deploy? [14:10:54] I'd like to check that against the last deploy snafu [14:11:35] <_joe_> hnowlan: simply looked at the logs from etcdmirror [14:11:47] <_joe_> but you can deploy to a single node, check [14:12:52] hnowlan: you were deploying mobileapps? [14:14:49] odd that all of the 500s seem to related to citations [14:14:57] s/to //g [14:15:57] odd in general terms, for restbase it's another day at the office [14:21:50] urandom: I was deploying restbase [14:25:13] hnowlan: ok, the bulk of the errors look like: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2023.37?id=Uhmnk4oBjC2iVgTOE3Nw [14:25:29] I don't know how those "internal requests" work though [14:26:07] Or what an "Unknown error" is for that matter [14:26:45] ultimately most errors will be a side effect of scap depooling all of the restbase hosts [14:39:41] <_joe_> yeah that's the issue [14:39:50] <_joe_> so let me check the scap config for restbase [14:41:06] <_joe_> ok so it tries the following [14:41:25] <_joe_> the depool check does run in stage promote and it runs depool-restbase [14:43:01] <_joe_> and then we run pool-restbase in the restart_service stage in theory [14:43:23] <_joe_> I'm looking at checks.yaml in restbase/deploy [14:44:40] <_joe_> hnowlan: so something possibly failed before the restart_service stage? [15:00:56] _joe_: looking at /srv/deployment/restbase/deploy/scap/log/scap-sync-2023-09-04-0001-1-ge8a6ae47.log - the depool gets run, but the repool never does (weird date format on the file too, but that's today's deploy's file I assure you) [15:01:22] also if you look at scap-sync-2023-08-01-0001-1-g26bc1a5b.log (my previous deploy that also exploded) the repool isn't run either, so my restart hack unintentionally fixed that issue [15:01:33] ugh [15:03:40] anyone deploying new kubernetes hosts? I'm running a makevm cookbook and getting a diff for kubernetes1028.mgmt.eqiad.wmnet [15:05:24] <_joe_> hnowlan: I would say the problem is that restart_service never gets executed [15:05:36] <_joe_> I would ask releng for help [15:06:13] inflatador: j.clark, see SAL [15:06:25] _joe_: yeah, writing a ticket atm [15:06:38] volans got it. convo continuing in dcops channel [15:25:36] filed https://phabricator.wikimedia.org/T346354 with some explanations for the issues we saw earlier if anyone's interested [15:26:02] I can't see anything we changed on the restbase side of things so there's a chance that other services that follow similar patterns are at risk [16:01:10] <_joe_> hnowlan: notably wdqs I fear [16:04:44] heads up on the above ryankemper [16:08:29] * inflatador just subscribed [16:09:31] we did a WDQS scap deploy yesterday and didn't notice this behavior [16:12:03] oh interesting! I'll have a look at the logs and compare [16:13:06] oh...forgot we have post-deploy steps that do depool/repool. Pretty sure our scap deploys don't touch LVS at all [16:17:08] ah, you don't use the same pooling logic that we do in our checks either [16:54:14] jhathaway: volans told me you might know which CA is currently used by puppetdb [16:55:29] I have a script that connects to puppetdb through a proxy, I was using the CA at modules/profile/files/puppet/ca.production.pem but that is not working anymore with puppetdb1003 [16:58:07] sorry, problem solved: I was connecting to puppetdb1003 but volans told me to use puppetdb-api.discovery.wmnet and that worked! [17:57:34] dhinus: glad to hear! [22:25:35] Gerrit seems unresponsive last 2-3 minutes. not getting any HTTP response [22:25:40] alert just fired I see