[01:15:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Aklapper) [01:15:06] 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Aklapper) [01:15:12] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Aklapper) [01:15:22] 10serviceops, 10Data-Persistence, 10SRE, 10cloud-services-team, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Aklapper) [01:15:30] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Aklapper) [07:20:11] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:31:04] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:49:12] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10Joe) 05Open→03Resolved [08:43:58] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [08:49:42] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) [09:08:13] 10serviceops, 10Observability-Alerting: Port mediawiki prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312764 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This has been completed by @Joe in https://gerrit.wikimedia.org/r/c/operations/puppet/+/885288 (thank you!) [09:37:45] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 4 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Jaifroid) Thank you very much for this test patch. Just to say that if it works for Wikivoyage, it... [09:43:32] any objections to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886828/ deploy ? [09:57:23] akosiaris: I suppose it won't fix the pages that are have old mobile-sections but only in the future or with a null-edit? [09:57:53] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10akosiaris) >>! In T327925#8587186, @Marostegui wrote: >>>! In T327925#8587104, @Joe wrote: >> I would suggest that instead of handling individual... [10:08:46] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling calendar - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [10:09:22] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [10:10:14] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) p:05Triage→03Medium [10:11:01] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Clement_Goubert) p:05Triage→03Medium [10:11:33] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Cool! I am going to repool the hosts then :) [10:13:46] 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) p:05Triage→03Medium [10:13:50] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) p:05Triage→03High [10:37:09] claime: yes, we 'll need to re-trigger jobs, but I 'll let that done by interested parties. a null or minor edit should suffice. In any case, let's first actually check that it works [10:38:11] I don't know enough about how changeprop works to give an informed opinion [10:42:18] I think nobody does [10:43:59] lol [10:55:27] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [10:56:28] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Clement_Goubert) [10:57:11] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Clement_Goubert) p:05Triage→03Low [11:12:37] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 4 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) Change deploy in all 3 environments (staging, eqiad, codfw). And problem, indeed fixe... [11:14:27] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 4 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) So, the next big question is whether we remove that allowlist entirely and support mobi... [11:15:57] worked like a charm [11:23:51] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) [11:26:02] 10serviceops, 10SRE, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) p:05Triage→03Low [12:38:25] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) Downtime part dry-runs correctly. I will reopen if I hit issues in the live-test. [12:38:34] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:38:42] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) 05Open→03Resolved [12:47:38] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10dcaro) [12:51:46] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I am repooling all the databases since we are going to fully depool codfw for reads. [14:04:40] ottomata if you're doing any of the flink/k8s stuff today hit me up [14:09:51] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [14:26:56] 10serviceops, 10observability, 10Patch-For-Review: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663 (10Volans) Just to add to the available options, listing the services, their A/A A/P status and in which DCs they are pooled is al... [14:39:44] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) If we're "just" depooling codfw it's worth noting we will still need to depool the affected ms-fe* nodes (since mw always tries to... [14:40:07] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10ssastry) >>! In T226931#8589104, @akosiaris wrote: > So, the next big question is whether we remov... [14:48:45] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) >>! In T226931#8589594, @ssastry wrote: >>>! In T226931#8589104, @akosiaris wrote: >> S... [14:58:37] 10serviceops, 10observability, 10Patch-For-Review: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663 (10Clement_Goubert) Removed references to disc_desired_state from wikitech LVS and SwitchDC docs [15:19:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:19:55] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:22:07] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Jaifroid) @akosiaris Thank you very much indeed for this fix! It'll make a big difference to us ov... [15:48:02] 10serviceops, 10Kubernetes: Show less diff context by default on helm apply - https://phabricator.wikimedia.org/T326205 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [15:49:02] Hey _joe_ regarding https://phabricator.wikimedia.org/T271184: Is <1% of events failed to be processed by changeprop something totally unexpected? It looks like only a few pages failed to be purged from the null_edit [15:50:00] <_joe_> nemo-yiannis: I have no idea, but if changeprop in general lost ~ 0.1% of messages, we would have noticed via the jobqueue [15:50:16] <_joe_> we need a smoking gun but sadly I doubt we have logs proving what happened [15:52:46] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:53:24] <_joe_> nemo-yiannis: how does restbase determines a request is for purging? [15:54:54] Its done with an internal request to restbase with `cache-control: no-cache` header [15:55:59] By internal i mean directly to restbase cluster, not through varnish [16:04:26] inflatador: o/ i'm going to be distracted this week by annual planning stuff. I don't have a lot of flink k8s stuff to work on at the moment. I want to work on https://phabricator.wikimedia.org/T328925, maybe you could help there? although that might be more enrichment job specific. [16:04:57] 10serviceops, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10JMcLeod_WMF) [16:15:57] 10serviceops, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10JMeybohm) @akosiaris discovered recently (more or less by accident) that we're overcommitting CPU by quite a bit on wikikube clusters. With kube-state-metrics we should be able to make this mor... [16:24:12] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10MSantos) a:05MSantos→03akosiaris Changing assignee to reflect reality. [16:45:52] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 2 others: Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) [16:49:57] mbsantos: so, what's your opinion regarding https://phabricator.wikimedia.org/T226931 ? I can easily add in wiktionary in the regex, but would that suffice? Or are we going to open a can of worms ? [16:58:59] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10akosiaris) @STran Hi, should we move this forward? I think as @Joe says, we 'll need to sync up a bit to see how to best move forward with deployment on our platform. [17:21:59] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [17:41:21] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) Task follow-up: * Tech news announcement: https://meta.wikimedia.org/w/index.php?title=Tech/News/2023/0... [17:41:46] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) 05Open→03In progress [17:41:50] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [17:42:20] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) Apart from multi-DC, the other possibly notable thing is that a Gitlab switchover will also be per... [17:43:27] .26 [17:54:31] 10serviceops, 10Data-Persistence, 10SRE, 10cloud-services-team, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Krinkle) [18:13:37] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10TheDJ) [18:22:31] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10Joe) The difference between the two groups of datacenters is the swift backend serving them. From what I understand, a bad thumbnail is stored in the eqiad swift datastore. [18:53:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [18:59:06] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) Yes I'm so sorry! This slipped my mind. I wrote up some internal documentation for the team that might be useful in this case: https://docs.google.com/document/d/1CqnWfwhjiEoQMK... [19:14:28] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) It is worth communicating anything that disturbs one's habits. :) Better safe than sorry! [19:39:37] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [19:39:59] ottomata ACK, will get eyes on that [19:40:14] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [22:11:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [22:21:36] 10serviceops, 10Performance-Team: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10RLazarus) 05Declined→03In progress a:03RLazarus I'm working on this. @Volans @Joe We talked about building this into Spicerack, but there's one complication: as-is, it runs on the maintenance... [22:25:53] 10serviceops, 10Performance-Team: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10Volans) >>! In T288867#8591323, @RLazarus wrote: > I'm inclined to start out by doing the first thing -- rewrite `warmup.js` into `warmup.py` to start with, and keep it on the maintenance host, in pa... [22:51:55] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0e96453-af13-467f-a75e-ebd1c4122a32) set by bking@cumin2002 for 1 day, 0:00:00 o... [23:18:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)