[04:32:37] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Joe) [05:00:13] 10serviceops, 10API Platform, 10Growth-Structured-Tasks, 10Image-Suggestions, and 7 others: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset b... - https://phabricator.wikimedia.org/T313973 [09:06:09] <_joe_> jayme: as you can see, my prediction proved right [09:06:27] _joe_: let me say I'm not surprised :) [09:06:59] but annoyed, still [09:11:52] _joe_: do you have some minutes to talk about this? [09:12:08] <_joe_> jayme: in a few [09:12:19] ack, ping me when you're ready [09:16:35] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Joe) [09:16:46] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Joe) p:05Triage→03Medium [09:47:34] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) 05Open→03In progress [09:48:03] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Clement_Goubert) [09:48:11] 10serviceops, 10Parsoid, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-publish: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 (10Clement_Goubert) 05In progress→03Resolved [10:11:26] Should we be looking at using mcrouter rather than nutcracker in the thumbor migration? [10:16:09] 10serviceops, 10GitLab, 10Release-Engineering-Team, 10serviceops-collab: Disable email notifications from GitLab replicas - https://phabricator.wikimedia.org/T318682 (10Jelto) [10:25:02] hnowlan: From my limited perspective, probably, but idk how that translates in effort [10:26:35] not a huge amount I'd guess, we use nutcracker in a pretty simple way and there's plenty of prior art in deployment-charts [10:26:54] but maybe for the initial lift'n'shift it makes sense to keep nutcracker [10:29:05] _joe_ and effie may have a more context-aware opinion [10:30:04] <_joe_> hnowlan: so, on one hand, we want to dismiss nutcracker [10:30:13] <_joe_> hnowlan: it's used for memcached right? [10:30:46] <_joe_> but yeah, I would keep the already large number of things that changed to the minimum for now [10:30:54] <_joe_> and we can switch to mcrouter later [10:31:25] _joe_: yeah, memcached. sgtm as an approach [10:41:25] Do we already have a serviceops dedicated project on WMCS? I'm currently doing my pontoon tests on sre-sandbox, but if we want to keep it, maybe a dedicated project would be best. [10:41:57] <_joe_> claime: good idea [10:42:08] ok, will make a phab request [10:42:25] I'll keep on with my tests on sandbox in the meantime, since it's supposed to be rebuildable easily [10:42:27] :p [10:46:48] hnowlan: I can help with mcrouter there later on [10:47:16] effie: great, thanks! [10:48:51] ping me when it is time, it shouldnt be much work, unless we need make some puppet stuff a little more asbtract [10:49:01] which cant be that part [10:49:02] <_joe_> not puppet [10:49:16] aaah [10:49:20] lol yes [10:49:30] <_joe_> it's kubernetes templates from hell [10:49:32] one less thing to worry about, oh then it is quite easier [10:49:40] <_joe_> no it is not :D [10:49:53] <_joe_> as we'd like to keep things similar [10:50:08] <_joe_> anyways, we can talk about it when it's time [10:50:20] yeah, eitherway, yes I'd love to help [10:52:41] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) [10:53:11] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) Starting tests in sre-sandbox while a specific WMCS project gets created for this. [11:44:47] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10fgiunchedi) Thank you for kick starting this! Agreed on the bumpy road (having set up `configcluster` role myself for o11y), please reach out when in doubt and/or hitting roadb... [11:46:33] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) I'm starting "easy" with `memcached` for now. Once that's working OOTB I'll move on to `configcluster`. Thanks for the offer to help, much appreciated. [11:48:07] 10serviceops, 10GitLab, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: Disable email notifications from GitLab replicas - https://phabricator.wikimedia.org/T318682 (10Jelto) p:05Triage→03Medium a:03Jelto [11:50:13] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Replace nutcracker with mcrouter - https://phabricator.wikimedia.org/T318695 (10hnowlan) [12:03:15] 10serviceops, 10Release-Engineering-Team, 10docker-pkg: docker-pkg / docker downloads all versions of parent image upon building - https://phabricator.wikimedia.org/T310458 (10hashar) For #serviceops We would need docker-pkg to get a `3.0.3` tag pointing to 66b22ed50 //Release 3.0.3// . That has not been p... [12:03:53] some how docker-pkg had a 3.0.3 commit but the tag is missing and the deploy repository hasn't been updated for 3.0.3 [12:22:34] 10serviceops: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Clement_Goubert) [12:23:05] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) [12:23:07] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Clement_Goubert) 05Open→03In progress [12:24:39] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Clement_Goubert) p:05Triage→03Medium [12:25:48] 10serviceops: Ensure configcluster bootstraps cleanly - https://phabricator.wikimedia.org/T318699 (10Clement_Goubert) [12:26:22] 10serviceops: Ensure configcluster bootstraps cleanly - https://phabricator.wikimedia.org/T318699 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [12:26:24] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10Clement_Goubert) [12:52:39] It's day two of my service ops journey! claime has graciously offered to guide me a bit. If anyone else wants to point me to some docs/phab tickets/repos or anything LMK. My hours are generally 1400-2300 UTC [13:13:54] inflatador: I'm about to create two tickets regarding our prometheus scrape config for k8s with might be good tasks to work on (and completable during your stay with us) [13:17:04] 10serviceops, 10Observability-Metrics, 10Kubernetes: Limit the envoy metrics scraped in k8s - https://phabricator.wikimedia.org/T318705 (10JMeybohm) p:05Triage→03Medium [13:17:29] 10serviceops, 10Observability-Metrics, 10Kubernetes: Limit the envoy metrics scraped from k8s - https://phabricator.wikimedia.org/T318705 (10JMeybohm) [13:17:55] jayme ACK, sounds good! [13:32:46] 10serviceops, 10Observability-Metrics, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10JMeybohm) p:05Triage→03Medium [13:33:33] 10serviceops, 10Observability-Metrics, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10JMeybohm) [13:35:11] jayme what repo has the prometheus config for k8s? [13:35:40] inflatador: thats all puppet - in modules/profile/manifests/prometheus/k8s.pp [13:36:02] ACK, will take a look [13:36:38] feel free to ask questions, add me as reviewer etc. - go.dog would be a good choice as reviewer as well [13:36:58] 🚌 := [13:42:24] 10serviceops, 10SRE: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10JMeybohm) a:03akosiaris I think this is done, right? [13:44:10] inflatador: you may assign the tasks to you when you start working on them. I just did not want force-assign you :) [13:46:31] we also have this beauty for someone willing to get exposed to helm/helmfile/k8s services: https://phabricator.wikimedia.org/T310721 if someone want's to take it on, I'm happy to provide giudance :) [14:09:10] 10serviceops, 10Observability-Metrics, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10bking) a:03bking [14:09:23] 10serviceops, 10Observability-Metrics, 10Kubernetes: Limit the envoy metrics scraped from k8s - https://phabricator.wikimedia.org/T318705 (10bking) a:03bking [14:11:03] jayme np, I grabbed the 2 prometheus tasks [14:11:14] 👌 [14:12:13] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Joe) [14:17:52] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Jdforrester-WMF) [14:56:51] jayme not sure if you're still around, but guessing the envoy config should go here? https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/k8s.pp#L180 [14:57:22] inflatador: actually from line 222 somewhere [14:57:51] the envoy tls-terminator/service proxy has it's own prometheus job, called k8s-pods-tls [15:01:28] inflatador: oh and if you have to look up docs, please make sure to look at prometheus 2.24 :) [15:01:30] https://github.com/prometheus/prometheus/blob/v2.24.1/docs/configuration/configuration.md [15:05:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [15:06:01] 10serviceops, 10Observability-Alerting, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Migrate kubernetes alerts away from icinga - https://phabricator.wikimedia.org/T311251 (10JMeybohm) 05Open→03Resolved Changed the alerts from using p99 to using p95, resolving this again. [15:10:00] jayme cool, can you link me the current dashboard? just curious what the metrics look like now [15:11:02] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Joe) [15:20:54] inflatador: the dashboards won't cover the incredible amount of stuff... I fear. But there is one calles k8s envoy telemetry (or something along those lines) [15:21:44] for the full pile of things you can "curl -s -XGET localhost:9631/stats/prometheus" on appservers or k8s pod ips [15:24:21] "curl -s -XGET 10.64.75.230:9361/stats/prometheus" for example [15:24:25] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10jijiki) While it seems like restarting the daemon after a change, is something we missed, it is actually by design. The reason is simply that we would like to control when each d... [15:24:33] which is the blubberoid pod currently running in staging [15:27:05] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Clement_Goubert) That's what I figured, although it ends up with a non-configured `memcached` on bootstrap. I can either dig to find a way to restart it only on the first puppet... [15:36:53] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Joe) We can probably check in an exec that the socket exists and restart memcached only if it doesn't, or something along those lines. [15:37:57] <_joe_> claime: we can design an exec to do ^^ [15:38:38] There's probably some sort of systemd dependency that can be used no ? [15:39:57] <_joe_> it's thorny [15:40:09] thar be dragons? [15:40:11] <_joe_> how do you tell it to just do it after package install? [15:40:16] <_joe_> uhm there is another way [15:40:30] Oh thorny in that sense, yeah fair [15:40:30] <_joe_> we can mask the service *before* installing the package [15:40:39] * jayme is out for the day o/ [15:40:42] Drop the override then install ? [15:41:07] <_joe_> then puppet will ensure the service runs [15:41:16] <_joe_> we used a similar pattern for nginx [15:41:18] <_joe_> IIRC [15:41:25] (drop in the parachute sense, not the database sense) [15:41:48] claime: profile/manifests/tlsproxy/instance.pp [15:42:18] <_joe_> we do exactly what you want basically [15:42:24] I'll check it as soon as I've tried something for etcd. [15:42:38] Them kids and their fancy distributed services [15:52:59] 10serviceops: Ensure that all appserver-related roles can be cleanly applied on bootstrap - https://phabricator.wikimedia.org/T318671 (10nskaggs) [16:01:22] quick workout, back in ~40 [16:22:58] 10serviceops, 10Patch-For-Review: Ensure wikimedia::memcached role bootstraps cleanly - https://phabricator.wikimedia.org/T318697 (10Clement_Goubert) As discussed, I went with the approach already used in `modules/profile/manifests/tlsproxy/instance.pp` for `nginx`. Will keep testing it in pontoon, and change... [16:23:42] * claime afk [16:31:23] Hello, this is a heads-up that I'm hoping to add a couple of images to the production-images repo over the next week or two. Namely rebuilt spark and the spark-operator images. https://phabricator.wikimedia.org/T318730 [16:35:00] I believe that this is a good use case for using production-images, as opposed to blubber, but I'm happy to take advice, guidance, criticism etc. Feel free to jump in on the ticket if you have any views, or if you'd like to be added for code review. [16:43:49] back [17:31:50] lunch, back in ~45 [18:24:22] back [20:18:23] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) @Joe which partman recipe do you need for these? [21:44:20] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye [21:47:39] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye [22:13:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye completed: - mc-wf1001 (**PASS**... [22:17:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye completed: - mc-wf1002 (**PASS**... [23:18:23] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) [23:19:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) 05Open→03Resolved @joe all yours, figured it to be the same partman recipe as memcache