[09:03:50] Hello SREs! My pcc report page ends up in a 404 in puppet-compiler.wmflabs.org: https://puppet-compiler.wmflabs.org/output/869166/38863/ Am I doing something wrong? Or is the service down? Please [09:04:53] aqu: o/ [09:07:19] trying to run pcc again on my side [09:10:07] mmm yes something looks odd, I get the 404 as well [09:11:56] * Emperor wonders why sirenbot hasn't got to /topic [09:12:19] Emperor: because it's not here :) [09:12:34] quit earlier today for timeout [09:13:30] https://wikitech.wikimedia.org/wiki/Vopsbot doesn't seem to say where/how it's running or one should kick it [09:13:46] aqu: I'd wait for jbond's opinion on this (if anybody else doesn't know) [09:16:08] _joe_: vopsbot is not here and i see in the logs: [09:16:18] lvl=eror msg="could not find the topic for this channel stored. Is the bot in the channel?" id=41cb6407027c89af host=irc.libera.chat:6697 nick=sirenbot error="sql: database is closed" [09:16:26] volans: I was wondering about systemctl restart... [09:16:29] Emperor: FYI it's on the alert hsots [09:16:49] volans: thanks, I had found my way to alert1001 via cumin A:vopsbot :) [09:19:03] Yeah just restart it [09:19:38] It needs to be regiven op afterwards iirc [09:20:52] Thanks elukey ! [09:22:11] looks chanserv knows to op it [09:22:21] !oncall-now [09:22:21] Oncall now for team SRE, rotation business_hours: [09:22:21] E.mperor, j.ayme [09:23:15] <_joe_> volans: the bot crashed last week AIUI when I was out [09:23:24] <_joe_> not sure what happened with the restart though [09:23:34] !refresh-topic [09:23:41] <_joe_> that doesn not work [09:23:51] <_joe_> it should update it within 5 minutes of restarting [09:24:06] OK, cool, I'll update the wikitech page to reflect that [09:46:18] is something broken with PCC? it seems to be generating links that don't work, for example here https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38865/console [09:48:01] There doesn't seem to be anything newer than 16-Dec-2022 19:58 in https://puppet-compiler.wmflabs.org/output/ [09:53:13] I don't have access to the workers [09:53:49] dhinus: you around? [09:53:58] yep [09:54:11] I see from wiki you have admin on that project [09:54:18] looking [09:54:27] Thanks <3 [10:01:00] I'm not really familiar with the pcc setup, but I see there are 3 worker VMs and they seem to be up & running [10:01:33] Trying to find where the puppet-compiler.wmflabs.org is pointing to [10:01:57] https://openstack-browser.toolforge.org/project/puppet-diffs says it's pointing at the db1001 VM [10:02:54] the data is there [10:03:06] in /srv/jenkins/puppet-compiler/output/868726/38865 on pcc-worker1002 for the last pasted link for example [10:04:04] I see the 404s in the nginx log in db1001 [10:04:30] something broken in the nginx config perhaps? looking [10:05:35] /mnt/nfs/labstore-secondary-project/output/868726/38865/index.html" is not found (2: No such file or directory) [10:06:59] last one seems to be Dec 16 19:58 868736 [10:07:36] nevermind [10:07:48] hm, so are the results served from NFS? [10:08:44] taavi: yes [10:10:15] some folders in /mnt/nfs/labstore-secondary-project/output are newer than Dec 16... but apparently some are missing? [10:10:49] newer in date but the IDs are older, that's what puzzled me [10:10:57] and inside the subfolders are older [10:11:08] this is from Dec 18 and it works, but the patch is indeed older https://puppet-compiler.wmflabs.org/output/852210/38269/ [10:11:14] so maybe it's failing to create new dirs? [10:11:18] so I guessed that the auto-clean might have updated the dirs deleting old ones inside [10:11:36] ah that's also possible [10:12:08] I can create a dir and inodes are just at 1% [10:12:27] can also create a fiel [10:13:39] * jbond here, looking [10:24:57] now the links seem to be working (even the ones that were not working before) [10:25:07] jbond: did you fix something? [10:26:39] dhinus: yes it should all be working now, still need to dig into why it broke [10:26:59] cheers [10:27:43] taavi: fyi pcc should be working again including the old links [10:30:09] Thanks jbond ! [10:34:21] cheers :) [12:03:42] FYI, I've briefly disabling Puppet on hosts with nginx for rolling out a patch [12:04:27] Ack, thanks. [12:11:31] all clear, Puppet has been re-enabled [14:51:07] jbond: thanks a lot for the 'Hosts: auto' xd, it's awesome [14:59:36] :) no problem [15:43:52] dcaro: I've accidentally spotted this in our ATS instances: WARNING: SNI (labtesttoolsadmin.wikimedia.org) not in certificate. Action=Terminate server=cloudweb2002-dev.wikimedia.org(208.80.153.41) [15:44:30] dcaro: ATS refuses to connect co cloudweb2002-dev.wm.o to serve labtesttoolsadmin.wikimedia.org cause labtesttoolsadmin.wikimedia.org isn't on the SNI list of the used TLS cert [15:45:52] Interesting, I think we might want it to be, is that a new behavior? [15:45:55] 610665beb73231a96423b4684403dd5fe3af1413 seems to be related /cc andrewbogott [15:46:14] dcaro: wdym? that's currently triggering 503 errors in the CDN :) [15:47:29] I was wondering if it was a recent change that broke it, or if it has been failing for a while [15:47:36] vgutierrez: I thin he means, has that warning been firing since 2020? [15:47:54] Anyway, let's just add that to the SNI list. [15:48:13] yup.. it's not new at all [15:49:12] I think it might have been a typo xd (labtesttooladmin vs labtesttool_s_admin) [15:49:39] well.. an underscore isn't a valid character on a hostname [15:50:42] Could someone close this puppet change? (I don't have the rights to do so; the person is no longer at WMF and task closed). https://gerrit.wikimedia.org/r/c/operations/puppet/+/608434/ [15:50:48] I was just highlghting the 's' ;) [15:51:10] vgutierrez: andrewbogott https://gerrit.wikimedia.org/r/c/operations/puppet/+/869235 [15:51:34] oh :) [15:51:41] Krinkle: you mean abandon? [15:51:44] (or merge) [15:51:56] abandon indeed [15:52:20] done [15:53:03] dcaro: that is a very straightforward fix :) [15:53:30] vgutierrez can decide if we merge or don't merge during code freeze [15:53:53] :? [15:54:09] that's gonna trigger a TLS cert issuance impacting cloud instances [15:54:18] dunno how I'm suited to evaluate that :) [15:55:00] well, it barely/slightly brushes up against the CDN. I'm definitely fine with it affecting cloudweb hosts. [15:55:21] what is the current damage? what happens if we wait? [15:56:26] labtesttoolsadmin.wikimedia.org is broken and triggers 503s [15:56:58] it being broken is not a big issue (it's been for a while), what about the 503s? [15:57:16] as in, what impact do they have? [15:57:34] nothing.. just useless requests from the CDN to cloudweb-2001 and back [15:58:40] xd, so if you have any concerns on the merge, I'd wait :) [16:01:44] yep, let's wait unless knowing those 503s are happening is keeping someone up at night [16:05:09] dcaro: no concerns at all [16:14:02] <_joe_> jhathaway: nullmailer was my first choice [16:14:24] <_joe_> but it's a full smtpd and needs a daemon running, which means we need a separate container for it [16:14:31] <_joe_> I *really* wanted to avoid that [16:38:10] _joe_: yeah totally understandable [16:38:36] I would rather invest time in an lvs mail service [17:11:39] urandom/cdanis: nothing to report from EU shift (cc Emperor) [17:12:32] FYI I'm grabbing cdanis's oncall shift due to sick day, the override doesn't start until :30 due to victorops limitations but I'll be keeping an eye out in the meantime (cc urandom) [17:13:03] rzl: whew! Thanks!