[08:32:28] <elukey>	 hello on-callers
[08:33:00] <elukey>	 I am going to make a puppet private commit on puppetserver1001 to test if everything propagates correctly to the other nodes (as opposed to use puppetmaster1001)
[08:33:18] <elukey>	 for everybody - please refrain from doing puppet private commits for now
[08:36:53] <elukey>	 you can puppet-merge freely
[08:37:00] <elukey>	 of course the commit didn't work, more details in https://phabricator.wikimedia.org/T368023#9972525
[08:37:17] <elukey>	 I'll git reset --hard HEAD~1 
[08:41:23] <elukey>	 in the README it is written to do git reset --hard HEAD^
[08:44:16] <elukey>	 ok HEAD^ is like HEAD~1 but it works for merge commits where multiple parents are there, it does the same thing as HEAD~1
[08:44:19] <elukey>	 doing it
[08:53:24] <elukey>	 ok made a little change to README on puppetmaster1001 and committed, all good
[08:53:45] <elukey>	 you can resume puppet private's usage, I'll work on the puppetserver's post-commit that failed
[08:53:55] <elukey>	 (I think I know what went wrong but I need to check)
[09:04:19] <elukey>	 IIUC the fix should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053623
[09:04:31] <elukey>	 if anybody wants to chime in feel free, more than welcome :)
[09:18:14] <jelto>	 GitLab needs a short maintenance break in around an hour
[09:47:57] <sukhe>	 we will be moving over Brazil to magru shortly, in ~10 mins at 10:00 UTC
[09:56:23] <vgutierrez>	 sukhe: 🍿
[09:56:35] <sobanski>	 :drumroll:
[09:56:43] <vgutierrez>	 🥁?
[09:57:12] <sukhe>	 vgutierrez: that looked like ramen to me for a second, won't lie
[09:59:03] <vgutierrez>	 🍜 != 🥁
[09:59:09] <sukhe>	 nice
[09:59:11] * vgutierrez feeling like c.danis
[09:59:32] <sukhe>	 here goes
[10:02:03] <sukhe>	 ok :)
[10:04:58] <sukhe>	 https://grafana.wikimedia.org/goto/Tulvc-_SR?orgId=1 
[10:05:07] <sukhe>	 for those who want to follow
[10:05:16] <vgutierrez>	 07 AM in .br.. slow ramp-up :)
[10:05:26] * sukhe goes and annotates some dashboards
[10:20:25] <XioNoX>	 people are going to wake up with a super fast wikipedia :) congrats
[10:21:04] <sukhe>	 indeed! 
[10:25:35] <fabfur>	 👏
[10:29:18] <jelto>	 GitLab maintenance finished
[11:08:28] <XioNoX>	 fyi, the DNS cookbook is broken right now, I'm working on it
[11:08:46] <sukhe>	 ok thanks
[11:12:13] <hnowlan>	 heads-up, we are about to disable parsoid pregeneration in restbase via changeprop 
[11:16:07] <XioNoX>	 sukhe, elukey, volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053640
[11:17:10] <XioNoX>	 The dns.netbox cookbook tries to run that command on the not yet production Netbox servers: `runuser -u netbox -- git -C "/srv/netbox-exports/dns.git" fetch netbox1002.eqiad.wmnet master:master` -> fatal: not a git repository (or any of the parent directories): .git
[11:17:39] <sukhe>	 that being netbox1002?
[11:17:44] <XioNoX>	 the other possible fix is to init the repo, but not sure yet why it's not done by puppet
[11:18:03] <XioNoX>	 netbox1/2002 are the current prod, and 1/2003 are the new ones being setup
[11:21:51] <XioNoX>	 thx!
[11:26:20] <volans>	 let me check
[11:26:43] <volans>	 it uses NETBOX_HOSTS_QUERY = 'A:netbox'
[11:27:03] <volans>	 and then gets the other netbox host as
[11:27:03] <volans>	 passive_netbox_hosts = remote.query(str(netbox_hosts.hosts - netbox_host.hosts))
[11:27:31] <volans>	 that's to sync the auto-generated dns repo between the two netbox hosts
[11:27:38] <volans>	 hence the issue
[11:27:58] <volans>	 I don't recall if that's a setup that scale easily to more hosts or not, to be checked
[11:29:55] <XioNoX>	 confirmed that the fix worked
[11:30:30] <XioNoX>	 I don't see how the repo gets initialized, and don't remember how we did it in the past
[11:30:37] <XioNoX>	 or if we ever moved it
[11:32:08] <volans>	 I don't either, could it be it was manual? I'm sure the homer's one is puppetized. Ideally the whole thing should be moved to use spicerack's reposync module that would move the repo to the cumin hosts
[11:32:20] <volans>	 although it would require some additional steps
[11:32:25] <volans>	 and refacotring
[11:32:35] <XioNoX>	 is there a task?
[11:33:17] <volans>	 don't think so
[11:34:11] <XioNoX>	 could you dump your ideas/suggestion on a task when you have a minute? maybe I can work on it after the netbox 4 upgrade
[11:36:49] <volans>	 I'll try to recollect the ideas around that, ok :)
[11:37:30] <XioNoX>	 thx! no rush
[12:47:49] <godog>	 oncallers: I'm temp-disabling benthos on centrallog2002 and have benthos on centrallog1002 take all the load for webrequest-live-sampled ; context is https://phabricator.wikimedia.org/T369737
[13:28:56] <godog>	 benthos is active again on 2002, though as a result there's kafka lag for ~half partitions and it is being consumed
[13:29:11] <godog>	 err, reducing rather than consumed
[13:30:58] <vgutierrez>	 hmm is gitlab struggling somehow? pull/push seems kinda slow
[13:33:34] <vgutierrez>	 latest push here: real    0m40.403s
[13:33:57] <vgutierrez>	 and it doesn't look like a networking issue: Writing objects: 100% (23/23), 8.42 KiB | 8.42 MiB/s, done.
[13:51:26] <vgutierrez>	 (back to normal now)
[15:30:56] <arnaudb>	 on call: nothing to report
[16:08:49] <swfrench-wmf>	 does anyone happen to know authoritatively (ha) what gdnsd will respond with for an active-active discovery service if all DCs are depooled? I can't seem to figure out from the docs whether it's nxdomain or servfail :)
[16:08:49] <swfrench-wmf>	 (active-passive is "easy" since we wrap it in a metafo resource that falls back to failoid)
[16:10:42] <sukhe>	 swfrench-wmf: it's a good question and I don't have an answer. it should be SRVFAIL IMO but I can't confirm that either
[16:10:50] <sukhe>	 we can introduce a dummy service and find out I guess :)
[16:14:12] <swfrench-wmf>	 thanks, sukhe! that might actually be the best way to find out, hehe
[16:22:21] <sukhe>	 swfrench-wmf: happy to +1 that if you want to give it a shot -- worth documenting it as well
[16:54:21] <duesen>	  I'd like to roll out a config change, any objections? It enables an experimental special page on testwiki: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1053723
[17:03:04] <duesen>	 going once....
[17:05:30] <duesen>	 going twice...
[17:08:52] <duesen>	 ok, on it.
[17:15:46] <duesen>	 scap is failing with "Check 'check_testservers_baremetal' failed"
[17:16:06] <duesen>	 Is that because the baremetal servers have been removed? I recall reading something like that.
[17:16:15] <duesen>	 Is it safe to continue with the deployment?
[17:16:52] <duesen>	 I'm getting things like this:
[17:16:56] <duesen>	 https://www.irccloud.com/pastebin/B6m56SRE/
[17:18:15] <rzl>	 duesen: no, you should abort
[17:18:38] <rzl>	 the baremetal mwdebug hosts are still running, and the tests are all passing -- that probably indicates a problem with your change that doesn't exist in prod
[17:18:47] <mutante>	 that's odd, I still get that 301 when doing a curl 
[17:19:03] <mutante>	 curl https://secure.wikimedia.org/otrs/  seems to be normal
[17:19:10] <rzl>	 it might be a false alarm, but at the least you should investigate further, not ignore it and deploy
[17:19:12] <mutante>	 301, not 404
[17:19:14] <duesen>	 ok, aborted
[17:19:21] <duesen>	 i'll post the full output here
[17:19:47] <duesen>	 https://www.irccloud.com/pastebin/3TycSG8r/
[17:20:48] <duesen>	 rzl: I don't really know what to investigate, or how. The patch doesn't really have much potential for screwing up infra, all it does is set a global variable for MW: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1053723
[17:21:04] <rzl>	 I'm looking :) just a minute
[17:21:19] <duesen>	 thanks!
[17:21:24] <duesen>	 let me know if there's anything i can do
[17:21:41] <rzl>	 interesting that you're only getting errors on three out of the four debug hosts
[17:21:55] <rzl>	 (and by sheer bad luck, the fourth is the one I tested)
[17:22:07] <mutante>	 it's all about secure.wikimedia.org. I can confirm that with httbb from deploy1002 to mwdebug2002
[17:22:21] <rzl>	 so, the tests are still failing *after* your rollback, which confirms your change isn't the culprit
[17:22:50] <rzl>	 you could go ahead if you're in a hurry, but if you have time I'd like to fix the tests first so they can still catch any actual problems
[17:22:57] <rzl>	 if that turns out to take very long we'll skip that
[17:24:21] <rzl>	 the tests are passing on mwdebug on k8s, so it really is only those three bare-metal hosts
[17:26:22] <duesen>	 It's not urgent. It would be cool if I could deploy this before I end my day in a couple of hours. I can also do it tomorrow, or next week.
[17:26:50] <duesen>	 Deb asked on Slack for cool stuff to share with the community, and this would be cool :)
[17:27:24] <rzl>	 sure :) I'll definitely get you unblocked before then, one way or the other
[17:27:33] <mutante>	 it's also on mw2299
[17:30:10] <duesen>	 there is now a merged change in the config repo that isn't deployed. Is that ok, or should i merge a revert?
[17:34:45] <mutante>	 rzl: could it be from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052128
[17:34:49] <mutante>	 that is merged today
[17:34:57] <mutante>	 and secure.wm.org uses the RUN_PORT variable
[17:35:09] <mutante>	 unlike the other vhosts
[17:35:27] <rzl>	 ha, I just found the same patch and was about to link it
[17:35:31] <rzl>	 yeah I think that's likely
[17:35:38] <mutante>	 the variable is not expanded?
[17:35:42] <mutante>	 in the actual apache config
[17:35:47] <rzl>	 I'm still not sure what's different between hosts though
[17:37:13] <rzl>	 duesen: either way, go ahead and deploy your change while we figure this out, thanks for your patience
[17:37:18] <mutante>	 maybe he just picked secure.wm as a test before changing it on all.. 
[17:37:31] <rzl>	 no, I mean, actual hosts -- why does it work on mwdebug1001 but not 1002
[17:38:09] <rzl>	 duesen: when you get to the httpbb failures, it's okay to continue -- just please double-check that there are no *new* failures
[17:39:09] <duesen>	 ok, I'll give it another go
[17:39:38] <mutante>	 rzl: on 1001 apache has been running 7 hours but on 1002 over 1 month
[17:39:49] <mutante>	 let's restart it on 1002 and see if it changes?
[17:39:52] <rzl>	 ahh, good catch
[17:40:01] <rzl>	 yeah, I'll wait until duesen is done and then try that
[17:40:04] <mutante>	 ack
[17:40:53] <rzl>	 (I think the puppet-triggered config reload should have been sufficient but I vaguely recall that's hit-or-miss for some kinds of apache changes)
[17:42:24] <mutante>	 same. there are some edge cases I think
[17:42:56] <mutante>	 probably because it changes the VirtualHost line itself
[17:43:57] <rzl>	 yeah I'd buy that
[17:46:39] <duesen>	 got the same errors again. continuing
[17:47:19] <rzl>	 👍
[18:03:33] <duesen>	 all went well, thank you!
[18:04:43] <rzl>	 thanks again for the patience! sorry for the speed bump, we'll get it fixed
[18:05:05] <rzl>	 I just don't want to get in the habit of ignoring that warning, when it's *not* a false alarm it's an outage preventer :)
[21:45:44] <rzl>	 mutante: in case you're curious, the apache restart didn't fix the tests on 1002 :) the plot thickens, still digging
[21:51:00] <mutante>	 rzl: try now!
[21:51:37] <rzl>	 what did you change?
[21:51:55] <mutante>	 restarted apache with systemctl
[21:52:51] <rzl>	 cool, that worked :) I'll do the same on the other hosts
[21:52:58] <mutante>	 :)
[21:53:00] <rzl>	 next time maybe let's coordinate a little more, I don't like that we were making changes on the same host at the same time
[21:53:04] <rzl>	 but thank you, that really helped
[21:53:36] <mutante>	 ok, yes
[21:55:11] <rzl>	 rzl@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug100\[1-2\].eqiad.wmnet,mwdebug200\[1-2\].codfw.wmnet
[21:55:11] <rzl>	 Sending to 4 hosts...
[21:55:11] <rzl>	 PASS: 132 requests sent to each of 4 hosts. All assertions passed.
[21:55:12] <rzl>	 \o/
[21:55:56] <mutante>	 I had seen it still had the "1 months something" in "systemctl status". that made me do that
[21:56:01] <rzl>	 ahh cool