[08:59:25] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10Maps (Maps-data), 10WMDE-TechWish-Sprint-2022-10-26, 10WMDE-TechWish-Sprint-2022-11-09: Resync stale maps postgres replicas - https://phabricator.wikimedia.org/T321885 (10awight) Amazing, thank you! I can confirm that the postgres replica sync lag graphs have dr... [09:01:47] 10serviceops, 10Maps (Maps-data), 10Patch-For-Review, 10WMDE-TechWish-Sprint-2022-10-26: Look into the replica sync fails - https://phabricator.wikimedia.org/T321722 (10awight) Kicking this out of the Tech Wishes projects, since the stale data on codfw is now resolved, and it looks like a more stable, long... [09:09:22] Morning :) [09:24:23] hi folks :) [09:24:54] before https://gerrit.wikimedia.org/r/c/operations/puppet/+/858995 I'd like to do some more tests (just to be sure) on staging-eqiad - I am going to restart some pods in there if you are ok [09:25:50] good morning and go ahead :) [09:34:39] thanks! Tested with eventgate and coredns, so far it looks fine [09:34:56] jayme: proceeding with https://gerrit.wikimedia.org/r/c/operations/puppet/+/858995/ then [09:35:20] ack [10:19:04] docker-registry.discovery.wmnet/pause:3.6-1 published! [10:19:34] \o/ [10:19:39] (before that I verified /etc/default/kubelet on all nodes via cumin, the k8s_116 tag has been pushed correctly) [10:20:26] not sure if we want to test it on staging nodes [10:20:34] or later on in k8s 1.23 tests [10:21:14] I think it's fine if we test it when the first cluster is 1.23 [10:21:19] super [10:21:25] I will use it in my pontoon cluster as well as of now [10:21:33] closing the task then [10:21:44] 👌 [10:22:19] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey) [10:23:41] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey) [11:31:47] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) >>! In T320403#8407057, @Daimona wrote: >>>! In T320403#8405891, @Clement_Goubert... [11:57:09] 10serviceops, 10Release Pipeline, 10Wikimedia-Portals, 10Release-Engineering-Team (Seen): Migrate www.wikipedia.org (and other www portals) to be its own service - https://phabricator.wikimedia.org/T238747 (10Joe) I have a fundamental question here: how do we plan to set up url routing to this secundary se... [12:04:11] hnowlan nemo-yiannis shall we take a quick look at https://phabricator.wikimedia.org/T321722#8405630 ? [12:13:07] I am around if i can be of any help [12:14:02] effie: yep sure! [12:14:12] My understanding from what we saw on Friday with Effie is that for some reason 1005 hasn't received the latest WAL record and is stuck in the initial one. Maybe there was some connectivity issue when sync started [12:14:37] the fact a restart fixed it at least means the slots are somewhat working [12:14:39] or the WAL message was dropped because 1005 was restarting [12:16:49] maps1005 is still lagging from what I see in Grafana [12:17:06] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) >>! In T320403#8405586, @Clement_Goubert wrote: > @Daimona I can deploy it on pro... [12:17:10] hnowlan: we restarted on Friday, but to no avail https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=15&from=now-7d&to=now [12:17:14] ohh [12:17:23] that is the ohhh part, yes [12:18:06] https://phabricator.wikimedia.org/P40229 these are new errors, not sure where they're coming from [12:18:18] these are my queries [12:18:35] the functions were not available in our pg version [12:18:57] ah [12:20:25] we can always do another sync, do some random replica restarts, and see if the problem comes back [12:21:27] ideally we don't need to do another sync, it's not very far behind [12:21:43] it is just 1+ MB [12:22:03] can someone who is more pg savvy than me, add something on the master [12:22:10] Maybe if I create a temp table or somethign [12:22:11] so we can check if that is replicated on 1005? [12:22:11] ? [12:22:15] sure sure [12:22:33] ok i can do that [12:22:55] replication is happening in the whole DB right? not specific tables [12:23:09] in theory, yes [12:24:46] I tested replication when we first set it up with a test table, and it was successful [12:25:04] but maps1005 is definitely behind, the received_lsn is the same as the receive_start_lsn [12:25:58] (whereas received_lsn has changed on all other nodes) [12:27:31] hnowlan: replication was prolly working alright on the 16th when we set it up on eqiad [12:31:14] wait, did something happen? replication lag is fixed on 1005 [12:32:17] akosiaris ran SELECT pg_switch_wal(); [12:33:32] nemo-yiannis had a theory on Friday as to what is wrong [12:34:38] My theory is prety much what I mentioned before, that because of the restart 1005 was stuck in a previous LSN and because imposm is disabled there was no activity to pickup a new WAL location [12:37:53] I will update the task as to what unstuck it from that state, and let's comeback to this when something else happens [12:45:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) 05In progress→03Resolved [12:45:15] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [13:00:24] 10serviceops, 10Maps (Maps-data), 10Patch-For-Review, 10WMDE-TechWish-Sprint-2022-10-26: Look into the replica sync fails - https://phabricator.wikimedia.org/T321722 (10jijiki) >>! In T321722#8405630, @jijiki wrote: > Even though we enabled replication-slots, with @Jgiannelos we noticed that maps1005 is be... [13:01:12] 10serviceops, 10Maps (Maps-data), 10Patch-For-Review, 10WMDE-TechWish-Sprint-2022-10-26: Look into the replica sync fails - https://phabricator.wikimedia.org/T321722 (10jijiki) @awight mind if we close this task and revisit if we are having replication issues? [13:47:38] so, yeah, as I said to effi.e, state on maps1005 was totally dumbfounding to me. Running: [13:47:39] template1=> SELECT pg_last_wal_receive_lsn() AS receive, pg_last_wal_replay_lsn() AS replay; [13:47:39] receive | replay [13:47:39] --------------+-------------- [13:47:39] E92/57000000 | E92/57111748 [13:48:14] shows something completely counterintuitive, that is that more WAL bytes have been replayed than they have been received [13:48:22] which is ... not possible? [13:48:47] on maps1009 btw, the pg_stat_replication table was consistently pointing out for the respective fields E92/57111748 [13:49:06] so i appears that somehow pg_last_wal_receive_lsn() was stuck in a previous counter value [13:49:37] anyway, forcing a new WAL segment on the master with SELECT pg_switch_wal(); had the replicas pick it up and update all their internal structures [13:50:01] something to keep in mind in case it happens again [13:58:38] 10serviceops, 10Maps (Maps-data), 10Patch-For-Review, 10WMDE-TechWish-Sprint-2022-10-26: Look into the replica sync fails - https://phabricator.wikimedia.org/T321722 (10awight) 05Open→03Resolved a:03awight >>! In T321722#8409135, @jijiki wrote: > @awight mind if we close this task and revisit if we a... [14:22:24] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) >>! In T320403#8408889, @Clement_Goubert wrote: >>>! In T320403#8407057, @Daimona wrote:... [14:44:27] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) >>! In T320403#8409372, @Daimona wrote: > That's a very good question; I'm not 10... [15:49:03] akosiaris: nice, thanks [15:59:32] 10serviceops: decommission wtp10[25-48] - https://phabricator.wikimedia.org/T307220 (10Clement_Goubert) [16:45:12] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) >>! In T320403#8409477, @Clement_Goubert wrote: >>>! In T320403#8409372, @Daimona wrote:... [17:01:09] lil thumbor fix for metrics https://gerrit.wikimedia.org/r/859106 [17:42:27] 10serviceops, 10SRE: Add `supervised` option to redis configuration - https://phabricator.wikimedia.org/T212102 (10jijiki) [17:45:14] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) This appears to be fixed. The issue relates to us calling Tornado's `set_header` with a string that contains non-ascii cha... [17:51:51] 10serviceops, 10SRE, 10observability, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) 05Open→03Resolved We haven't had any issues caused due to high memcached traffic for quite a long time. Our measures (gutter pool, o... [18:10:10] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10Vlad.shapik) >>! In T323114#8410298, @hnowlan wrote: > This appears to be fixed. The issue relates to us calling Tornado's `set_hea...