[08:45:18] head's up for today's codfw row C upgrade, there are a few outstanding hosts in https://phabricator.wikimedia.org/T334049 (WMCS cc balloons, data-engineering cc btullis, sessionstore2002 cc urandom, o11y cc godog, DBAs cc kwakuofori, service ops cc akosiaris) [09:00:38] XioNoX: Thanks for the heads-up. I will get on it now. [09:01:37] no pb! let me know if I can be of any help. The switch prep work went fine, so last step is the reboot :) [09:18:04] I'm getting errors from any task page in phabricator. Is this affecting others too? [09:18:19] see -operations [09:18:27] Ah yes, thanks. [09:19:11] apparently we have created a dependency, phabricator depends on gitlab... this probably needs revisiting [09:21:07] +1 [09:45:17] volans: do we have cookbooks that commit code to the puppet repo? [09:46:06] no, on purpose so far [09:46:13] AFAIK we don't even have a system user for that [09:47:25] I'm asking cause I'd like to automate switching from varnish handling port 80 to haproxy handling port 80 [09:48:27] do you have an example patch? maybe we can come up with a different approach [09:48:30] so that requires a commit to make varnish stop listening on port 80, run puppet, restart varnish, and then add another commit to let haproxy handle port 80 [09:48:40] volans: indeed, one sec [09:48:53] volans: https://gerrit.wikimedia.org/r/q/topic:T322774 [09:53:07] I'll bother you in private for more details :) [10:20:23] XioNoX: was there a specific DBA one which has not been taken care of? Manuel prepped for this last week as far as I know [11:00:07] XioNoX: codfw depooled [11:01:06] ccccccktekkjnvtlrtghbbgjlugljikgetvienevnfgv [11:02:17] let's play again: cat or yuvikey? [11:02:39] :-P [11:29:11] yes :D [11:29:59] in this case, I think it may have been petting the yubikey while reaching to pet the cat :D [11:30:30] sorry :D [11:32:21] cat tax: https://usercontent.irccloud-cdn.com/file/i7gJ3b2i/20230502_130627.jpg [11:44:43] XioNoX: thank you, yeah I think we're good, I'll update the task [12:02:16] kwakuofori: see the data persistence table in https://phabricator.wikimedia.org/T334049 cassandra-dev2002 (I guess we don't care), moss-fe2001, ms-be[2042,2048-2049,2054-2055,2058,2064,2068,2072], ms-fe2011, thanos-be2003 [12:46:25] XioNoX: with codfw depooled, we should be OK on all of those (and sessionstore2002); I've updated the ticket accordingly [12:46:41] urandom: thanks! [12:55:10] volans, godog, do you remember how to work around the message "Host alert2001 was not found in Icinga status" in the downtime cookbook? [12:55:29] XioNoX: I do not :( [12:55:32] I think it's something like just removing the host from the targets but I forgot [12:58:57] ORES machines affected in codfw (2005, 2006) have been depooled [13:01:20] got it `cumin1001:~$ sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row C upgrade" -t T334049 'P{P:netbox::host%location ~ "C.*codfw"} and not D{alert2001.wikimedia.org}'` [13:01:20] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [13:01:58] I believe that as soon as the downtime is done, we're good to reboot the switches [13:02:04] is there any reason to wait [13:02:05] ? [13:04:12] going once [13:04:25] going twice [13:04:44] let's go! [13:05:01] System going down in 1 minute [13:08:07] XioNoX: '... and not P{alert2001*}' [13:08:35] volans: thanks, I RTFM :) [13:10:39] akosiaris: any idea what caused that? [13:10:48] codfw switch ? [13:11:09] it's the docker-registry in codfw that is complaining, I am assuming it's the switch maint [13:11:14] akosiaris: I mean was it expected, missed downtime? [13:11:27] or just no bid deal? [13:11:30] XioNoX: probably missed downtime [13:11:38] it's VMs [13:11:46] so probably not downtimed? [13:12:06] ok [13:12:34] They must be on the ganeti hosts that are in row c [13:12:58] akosiaris: stupid question - the docker registry is not pooled in eqiad right now, and it seems that we got a page for codfw, is it still available? [13:14:07] running deployment-chart's CI locally doesn't work for me right now [13:14:16] docker: Error response from daemon: received unexpected HTTP status: 503 Service Unavailable. [13:14:44] fyi, upgrade is going as expected so far [13:14:58] elukey: it is not available right now [13:15:03] and it's on purpose NOT pooled in eqiad [13:15:24] cause there is a non negligible risk of corruption if we pool it [13:15:28] yeah but my question is - shouldn't we have failed over? [13:16:01] ah ok, didn't know that.. does it mean that if we have to failover at some point it may be a risk? [13:16:10] the registry? for a pretty disruptive switch maint ? [13:16:10] (asking to understand what to do in case it gets on fire) [13:16:15] probably not worth it [13:16:27] the thing that breaks is deployments AFAIK [13:16:32] 3/7 up [13:16:58] not sure why CI isn't running locally for you, once you got the image locally you shouldn't need to contact the registry ? [13:17:37] that is not the problem, I just wanted to know what to do in case the codfw registry hosts goes on fire :) [13:17:53] I see all 8 machines back pinging for ML [13:18:33] there's 2 of them per DC, and both are pooled (per DC). In case all of codfw going haywire, that's a different discussion [13:18:48] but frankly, the registry is not going to be your problem if that happens [13:18:51] switches are back up and healthy [13:18:58] Lotsa alert noise as services are half-back/recovering [13:19:05] akosiaris: ok I gather that I'll read it on wikitech :) [13:19:12] (ProbeDown, KubernetesCalicoDown) [13:19:29] registry is happy again [13:20:01] Waiting with re-pool of ores machines until the recovery noise has cleared [13:21:27] effie: I 'd say wait for like 10-15m and then pool codfw ? [13:22:46] I will do so in about an hour, I have a meeting [13:22:48] +1 [13:23:15] going to wait a bit and repool codfw for DNS [13:25:29] ML (ores) machines repooled [13:29:41] effie: wait for Amir1's ok, to make sure dbs caught up before repooling [13:30:01] cool thank you! [13:30:02] on it [13:30:09] s1 is back [13:30:18] from I restarted replication from eqiad master [13:30:29] they should be ok quickly, but sometimes they stop with an error, so better check [13:31:09] overview of clusters look ok to me https://orchestrator.wikimedia.org/web/clusters [13:31:14] Amir1: I wont repool sooner than an hour from now, no rush [13:31:22] or at least 45' [13:31:24] the red and yellow ones are eqiad that I'm fixing that [13:32:54] what I'm fixing is eqiad, it's not really even codfw and shouldn't block the repool. So don't worry [13:33:20] yeah, I was only referring to codfw replicas [13:35:11] yup, orch says the coast is clear, the logs are clean, life is good [13:36:07] I know the proxies were passive, but should I reloaded for alert health? [13:36:13] *reload them [13:36:39] sure. I keep forgetting the dbproxies [13:36:51] ok, doing dbproxy2003 dbproxy2004 [13:37:04] Thanks :*** [13:37:11] in theory nothing wrong should happen, but it doesn't hurt to check twice! [13:43:40] ok, I see no weird alerts on icinga for codfw dbs [13:44:30] parsoid probes are down since 13:38 in codfw [13:45:15] godog: thanks a lot for T333204 ! [13:45:15] T333204: AlertManager: Permanently silence -dev and -test hosts - https://phabricator.wikimedia.org/T333204 [13:45:53] XioNoX: sure np! was easy enough :) [14:00:53] does anyone know where I'd change dsh_targets for scap? Right now it just says 'wdqs' for mine, I assume it is somewhere in Puppet? [14:02:07] inflatador: https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/deploy/+/refs/heads/master/scap/scap.cfg [14:06:03] gehel I'm still not clear where the list of hosts that is considered 'wdqs' comes from [14:06:09] inflatador: also, the wdqs group gets automatically populated on deployment hosts via confd. So you can use confctl (https://wikitech.wikimedia.org/wiki/Conftool) to set a host to inactive to have them removed [14:06:39] Excellent, thanks akosiaris [14:06:59] note that pool/depooled means they just don't receive traffic. inactive means they are removed from scap too (and in fact anything that gets populated via confd) [14:08:36] Can someone double-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/914339 and the associated changes in private ? [14:08:48] https://phabricator.wikimedia.org/T313227 [14:11:18] inflatador: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/scap/dsh.yaml [14:12:08] Ah OK, that shows it's gathered from conftool per Alex comment [14:12:21] claime: why are we adding wikifunctions.org to private certs again ? [14:12:45] ah scratch that, I remember [14:12:57] ATS ftw [14:17:08] claime: -1ed, parsoid certs missing [14:17:20] ack [14:19:12] there is also rendering.svc..wmnet cert in there [14:19:19] that apparently we no longer use ? [14:19:29] Anyway, that would be an unrelated cleanup step [14:21:03] <_joe_> any zuul experts here? [14:22:09] <_joe_> ah it's a known issue [14:22:26] akosiaris: added [14:29:33] thx [14:49:21] Amir1: am I good to pool codfw ? [14:49:32] effie: yes from my side [14:49:54] cheers dear, thanx [15:28:11] I'll be afk for a bit. [21:12:24] Is it known that the search seems to be broken at https://doc.wikimedia.org/spicerack/master/search.html ? [21:19:24] brett: no I didn't, thanks for noticing. That seems to be https://github.com/readthedocs/sphinx_rtd_theme/issues/1452, I've just sent a patch at https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/914409 [21:20:04] and works locally, so I'm gonna merge it so that it fixes also doc.w.o [21:37:59] brett: that's now fixed (force a refresh/uncache), I'll fix the other projects that use the same doc settings tomorrow [23:35:43] volans: Thanks so much :)