[08:39:41] 10serviceops, 10envoy: Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) [08:44:35] 10serviceops, 10envoy: Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) First puppet run does indeed create `/etc/envoy/envoy.yaml` if isn't present, trying to fix its permissions ` Notice: /Stage[main]/Envoyproxy/File[/etc/envoy/env... [08:53:56] 10serviceops, 10envoy: Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) This is preventing zero-touch reimage of hosts running envoy AFAICS. Two solutions I can think of: 1. teach puppet to fix permissions on `envoy.yaml` only if it... [08:54:33] I'd like your take on ^, I'd like to fix that today because manually fixing hosts post-provisioning is not fun [08:56:55] 10serviceops, 10SRE, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) a:03JMeybohm [08:58:05] actually the solutions won't fully fix the problem, updating task [08:58:06] godog: oh, puppet and 0 byte files [08:58:14] The dynamic duo [08:59:12] lolz claime [08:59:23] yeah it is a common one now :( [08:59:56] or rather, well-known the vicious circle of file + notify -> exec'd notify fails and never called again [09:00:01] unless file changes [09:00:42] 10serviceops, 10envoy: Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) There's also a related problem, which is more a puppet one, that if the build-envoy-config exec fails (like it happens on the first puppet run) it is never retrie... [09:00:46] ok ^ explains the problem better [09:02:57] wait wouldn't that be fixed by just making the file resource ordered after the build-envoy-config call? [09:03:14] (disregarding the exec on change issue) [09:03:30] (just talking about the 0-byte file) [09:03:53] brb E_NOTENOUGHCOFFEE [09:03:59] I don't think so, my apologies I've muddied the waters there with zero-byte file [09:04:48] the issue is the build-envoy-config never being called again on failure, unless some of its component files change [09:05:28] or said otherwise, even if the zero-byte file wasn't there puppet wouldn't be able to recover by itself [09:06:03] updating the task [09:06:34] 10serviceops, 10envoy: Puppet doesn't self-recover on build-envoy-config failure - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) [09:07:16] godog: ok let me see if I understand the complete problem [09:07:43] claime: yeah https://phabricator.wikimedia.org/T346129#9162701 should help [09:07:58] permissions on /var/log/envoy break verify-envoy-config on first puppet run [09:08:12] then it's never called again because the config does not change [09:08:17] that's correct [09:09:48] So two things need to be done, 1) fix the permissions so that it doesn´t fail on first run 2) implement a diffcheck that isn't puppet's in build-envoy-config [09:10:31] indeed, with 2) it'll "fix" 1) too [09:10:45] Well not really [09:10:46] I believe (and rightfully so IMO) we've given up on first puppet run working [09:10:53] tsk [09:10:55] :p [09:11:33] lol we even reboot the host after the first puppet run and then run another [09:13:41] I'll go ahead with my reimages for now since it is only four hosts total, but yeah definitely an issue [09:15:11] I can give the diffcheck a try though if there's consensus that's what we want claime [09:22:41] I'll be honest I'm not a fan of not using the actual puppet mechanics available to build config from fragments etc. but I assume there's a good reason it was done using a config build script in the first place [09:22:48] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10akosiaris) Hi @Xover, I can sense the frustration pretty cle... [09:23:55] I don't know the history, but yeah I assume so too [09:25:19] I also assume we have bigger fish to fry than revamping completely the envoy config build process, so I am not opposed to doing it the way you did for blackbox-exporter, I'd just like at least another opinion on it before you or I commit work to it. Makes sense? [09:25:46] (I can formulate that in the task too for documentation purposes) [09:25:54] 100% makes sense claime, thank you [09:29:17] bbiab [09:53:32] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) >>! In T337649#9162778, @akosiaris wrote: > I can sens... [10:00:51] 10serviceops, 10SRE, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Clement_Goubert) I'm putting mw2444 back into `pooled=no` (instead of `pooled=inactive`) so it gets scap updates and stops warning, however I'll wait until we're sure it's stable before actually putting it back in pr... [12:34:19] 10serviceops, 10CX-cxserver, 10RESTBase Sunsetting, 10Language-Team (Language-2023-July-September): Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10MSantos) [12:53:34] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10akosiaris) >>! In T337649#9162878, @Xover wrote: >>>! In T337... [14:05:28] 10serviceops, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10lmata) adding to quarter for tracking [14:13:13] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10VirginiaPoundstone) [14:16:29] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10VirginiaPoundstone) p:05Low→03High [14:37:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [14:53:28] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) a:05BTullis→03None Removing myself as the assignee, since it appears that I will need assistan... [14:54:20] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) >>! In T346244#9164100, @BTullis wrote: > It appears that I do not have the required rights to creat... [14:56:24] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) >>! In T346244#9164113, @dancy wrote: >>>! In T346244#9164100, @BTullis wrote: >> It appears that... [14:58:23] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-... [14:58:33] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-... [15:01:41] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) 05Open→03Resolved a:03dancy You're all set. [15:50:11] trying to debug an envoy/restbase/wikifeeds issue atm if anyone is about - we migrated wikifeeds to stop using restbase, but now restbase checks are failing. Envoy is returning "upstream connect error or disconnect/reset before headers. reset reason: connection termination" [15:50:31] (envoy as in the wikifeeds port on the local service proxy envoy) [15:50:38] but the service itself is healthy afaict [15:57:01] hnowlan: yeah, downstream works through curl but not curling localhost:6017 [15:57:04] very strange [15:57:23] downstream as in https://wikifeeds.discovery.wmnet:4101 [15:57:29] I wonder if it's circuit breaking or some kind of connection reuse that was previously preventing this issue before [15:58:16] because we haven't touched restbase itself [15:58:27] we're just not routing to wikifeeds via it any more [15:58:58] we don't have metrics in grafana for circuit breaking I think although they're not the easiest metrics to sort [15:59:19] I don't think we have that functionality enabled [16:00:23] is this change in HTTP status patterns expected? https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=now-6h&to=now&viewPanel=13 [16:00:32] https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=wikifeeds&from=now-3h&to=now&viewPanel=20 [16:00:48] suddenly 3xx increased from 0 to 13 [16:00:54] is this happening because we previously had very active persistent connections that we're now only using for monitoring? [16:01:29] akosiaris: during the migration the random endpoint was changed by the team to return 301s directly, so that's expected [16:01:36] ah, ok [16:02:06] I was trying to add a new feature to eventstreams and I realized that it is still running on stretch [16:02:09] * elukey cries in a corner [16:06:00] how do checks get from the restbase schema yaml to icinga checks? [16:07:37] ah, service-checker [16:08:27] It's also transient [16:08:39] I just checked locally a restbase host that was alerting [16:10:07] I g2g sorry :/ [16:10:47] later! [16:11:00] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) 05Open→03Resolved @Vgutierrez faulty disk has been replaced and I see two disks on the server now. returning the bad disk to dell under 783662118185 [16:22:22] it hasn't happened in 20 minutes, and there's been a drop off in error rate, but it's still very high https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=wikifeeds&from=now-3h&to=now&viewPanel=16 [16:23:52] at this point I think the two options are roll back or remove the check (as it's not checking anything user-facing at this point) [16:24:58] the right option is to figure out what's going on by finding better metrics but it's getting late [16:29:28] hnowlan: so, that percentage is in absolute numbers, 0.1 rps? [16:29:38] and just 503s? [16:29:42] do I get this right ? [16:30:19] with 200s being 0.3rps? [16:30:33] yeah that's about right [16:30:50] smells like just health-checks indeed [16:30:56] so, I 'd say no need to rollback [16:31:07] that's not a marked increase relative to the orriginal rate of 503s now that I look at it also [16:31:49] yup [16:44:00] 10serviceops, 10Fundraising-Backlog, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10greg) p:05Low→03Triage (resetting prio, looks like a mistake given security is done with this right now) [16:48:46] 10serviceops, 10Fundraising-Backlog, 10SecTeam-Processed: FRUP: Add Applepay verification code to donate wiki - https://phabricator.wikimedia.org/T346055 (10sbassett) >>! In T346055#9164475, @greg wrote: > (resetting prio, looks like a mistake given security is done with this right now) Whoops, yes, sorry! [17:34:52] 10serviceops, 10Data-Platform-SRE, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) >>! In T346244#9164100, @BTullis wrote: > I've been following the instruction here: https://wikitech...