[00:43:18] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10Tgr) >>! In T345185#9314798, @aaron wrote: > I'm reading up on o... [09:02:13] 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [09:02:24] 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:03:46] 10serviceops, 10SRE: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [09:13:55] 10serviceops, 10SRE: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [09:18:45] 10serviceops, 10MW-on-K8s: Incorrect php redirect in some virtualhosts in mw on k8s - https://phabricator.wikimedia.org/T350770 (10Joe) [09:18:55] 10serviceops, 10MW-on-K8s: Incorrect php redirect in some virtualhosts in mw on k8s - https://phabricator.wikimedia.org/T350770 (10Joe) p:05Triage→03High [10:57:22] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Service-deployment-requests, 10Trust and Safety Product Sprint: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) [12:44:29] 10serviceops, 10CX-cxserver, 10RESTBase Sunsetting, 10Language-Team (Language-2023-October-December), 10Patch-For-Review: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10daniel) [13:51:35] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10JMeybohm) a:03JMeybohm [14:01:01] jayme: can you offer any tips on how next to troubleshoot this? [14:01:02] https://phabricator.wikimedia.org/T350713 [14:01:53] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move namespace management out of helmfile into a chart - https://phabricator.wikimedia.org/T350783 (10JMeybohm) [14:06:32] ottomata: on the list, didn't have the chance to take a look really [14:06:59] boils down to: increase loglevel in envoy [14:07:29] ottomata: https://wikitech.wikimedia.org/wiki/Envoy#Runtime_configuration [14:08:01] plus get some data from the telementy dashboards on how often this happens [14:08:04] and when [14:13:10] k thanks. telementy dashboards ... looking... [14:13:54] they are called "envoy telemetry" and "envoy telemetry (k8s)" I think [14:14:05] found em, thank you [14:14:20] this is awesome [14:14:29] this is why service mesh is cool [14:15:00] so [14:15:01] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=eventgate&var-destination=schema&from=1699366498479&to=1699452898479&viewPanel=17 [14:15:14] and this makes sense, schemas are eventually cached if the req succeeds [14:15:26] so on deployment, there are lots more requests, and hence lots more failures [14:16:38] ottomata: btw. could you please add the concurrency/worker things you did to the nodejs stuff into the best practice section of https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits ? [14:17:14] yes for sure [14:25:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:29:53] 10serviceops, 10MW-on-K8s: Incorrect php redirect in some virtualhosts in mw on k8s - https://phabricator.wikimedia.org/T350770 (10Joe) 05Open→03Resolved [14:29:54] jayme: https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#NodeJS_service-runner [14:30:08] also elukey ^ fyi lemme know if that looks correct / make changes at will [14:33:53] <_joe_> ottomata: I think luca is out [14:34:23] he is - but looks good to me as is, thanks [14:39:29] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Service-deployment-requests, 10Trust and Safety Product Sprint: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10Marostegui) [14:42:18] FWIW, the envoy error is happening in eventstreams too: https://grafana.wikimedia.org/goto/fvuc2_VSz?orgId=1 [14:42:20] hm [14:46:55] 10serviceops, 10Data-Engineering, 10Event-Platform: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10JMeybohm) 05Open→03Resolved [14:48:14] hm, suspicous: every 5ish minutes? https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=eventstreams&from=1699451286311&to=1699454886311&viewPanel=17&var-destination=All [15:18:30] it is possible that envoy or the upstream is somehow throttling the requests? [15:23:00] the every 5 minutes is just eventstreams-internal stream_config_ttl; it re-requests everything every 5minutes [15:23:24] but that is sort of similiar to what evetngate-analytics-external does every time it boots. [15:24:19] i just tried to curl the envoy schema endpoint in a loop, no 503s... [15:43:52] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) I also see these [[ https://grafana-rw.wikime... [16:07:36] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) >>! In T344324#9239207, @Jgiannelos wrote: > Major upgrades of tegola + dependencies + base d... [16:20:21] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Some googling indicates that this could possi... [16:21:10] jayme: done alittle more investigation but i'm stumped for the moment: https://phabricator.wikimedia.org/T350713#9316743 [16:22:00] i don' t have a good handle on connection routing between envoy and the upstream [16:22:14] there is k8s networking stuff i assume, is it possible something in there is terminating the connection? [16:27:37] 10serviceops: Package latest version of prometheus-memcached-exporter - https://phabricator.wikimedia.org/T350807 (10jijiki) [19:05:57] 10serviceops: Replace Nutcracker for Redis (Thumbor, API Gateway, Changeprop) - https://phabricator.wikimedia.org/T333019 (10Krinkle) [20:29:30] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) a:03Ottomata [21:20:48] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10EBernhardson) >> If you want to use the job runner cluster, IMO... [23:38:00] Heya! Would it be possible to default new gitlab repos to fast-forward merge method?