[01:08:48] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [01:43:23] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [02:58:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [02:58:48] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff I am trying to get Buster on those PE R450 it looks like we are missing some drivers. (PERC H745 Controller,) Thanks... [04:55:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [07:16:39] 10serviceops, 10Data-Persistence, 10SRE, 10cloud-services-team, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Marostegui) Option #5 sounds good. We'd need to do a switchover though for that master whenever we reach the row A eqiad switch... [08:20:31] 10serviceops, 10SRE: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10ayounsi) 05Resolved→03Open Reopening this task as the issue is still happening. Thanks to o11y the dashboard has been refreshed and have more informations (TCP flags, source/dest hostnames).... [08:28:53] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:29:09] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) We'll depool eqiad I would assume? cc @Joe @akosiaris We'd still need to switchover m1 master (we do have m1 databases but I guess we are not s... [08:36:32] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [08:38:47] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [08:41:31] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:55:43] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) >>! In T226931#8594865, @Brycehughes wrote: > @akosiaris How painful would a full cache... [09:04:00] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [09:04:40] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [09:23:57] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10akosiaris) @Joe, should we code restbase-async in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/... [09:42:01] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10akosiaris) eqiad will still be depooled for this one. The current timeline for repooling eqiad in on March 8th, 1 day after the proposed timeline on this t... [09:45:44] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [10:02:02] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [10:26:58] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [10:40:20] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Clement_Goubert) If we do, we need to keep in mind that we're going to keep restbase-async pooled only in codfw for as long as possible/1 week during {T... [12:10:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10MoritzMuehlenhoff) >>! In T326362#8596391, @Papaul wrote: > @MoritzMuehlenhoff I am trying to get Buster on those PE R450 it looks like we are missing some drivers. (PERC H745 Cont... [13:16:09] _joe_: seach breaks rollback of the sre.discovery.datacenter because it's declared as a service without discovery in service.yaml, but there is a discovery record called search for search-https. [13:16:17] (I've been playing with it in dry-run) [13:16:48] <_joe_> I'm not sure I understand [13:17:00] <_joe_> can you paste what you saw in a task? [13:17:02] * gehel is listening (and not sure I understand either) [13:17:27] ok gimme a sec [13:27:21] 10serviceops: search service breaks rollback of sre.discovery.datacenter cookbook - https://phabricator.wikimedia.org/T329175 (10Clement_Goubert) [13:27:27] _joe_: ^ [13:28:37] gehel: It's got nothing to do with the service itself btw, just its declaration [13:29:28] I think basically we're assuming in the cookbook that the discovery record name and the service name will be the same, and that's not the case for search [13:30:01] The cookbook is under development so it's not surprising we run into issues like this [13:37:50] Going to grab a bite, bbl [13:40:37] <_joe_> claime: we don't [13:40:43] <_joe_> (assume that) [13:41:27] my bad then [13:42:36] <_joe_> claime: do you have the full output? [13:42:51] <_joe_> also, why was it rolling back? [13:43:10] <_joe_> ahhhh *damn* [13:43:25] <_joe_> did you abort mid-run, right? [13:43:33] <_joe_> yeah it's obvious what the problem is [13:43:57] <_joe_> ok at least it's an easy fix after lunch :) [13:48:54] <_joe_> (self.initial_state only contains data for things we've acted on, DUH) [13:49:09] <_joe_> so, TLDR I'm an idiot [14:42:33] _joe_: Yeah, I ctrl-C'd it [14:42:38] And I was testing the rollback [14:42:47] (dry-run, obviously) [14:47:34] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T328875#8590606, @Joe wrote: > The difference between the two groups of datacenters is the swift backend serving them. Fro... [14:51:13] akosiaris: any chance you have a copy of the incorrect thumbnail from the above ticket? Curious as to whether I can debug why we generated it [14:53:08] sigh, no, I didn't think of keeping it around [14:53:52] that being said.. I think it was generated in 2017 [14:53:58] < last-modified: Thu, 14 Sep 2017 15:13:14 GMT [14:54:20] so, does it even make sense to try and figure why something was generated like that 5+ years ago? [14:54:45] what's pretty weird btw, was that I couldn't see it in the output of swift list [14:55:03] that cost me like 15-20 mins before wondering whether swift stat would work [14:55:26] and also my trust in swift right now has plummeted [14:56:39] I 'll paste the curl -v output in phab in case it helps [14:58:47] ahh okay [14:58:56] no worries in that case, missed the date [14:59:10] hnowlan: https://phabricator.wikimedia.org/P43827 [14:59:28] the diff is pretty big [14:59:33] ah thanks! [14:59:44] differences between DCs like that spur mild panic as regards potential drift between thumbor/thumbor-k8s [14:59:47] as in I can see xkey: File header, I can see server: Thumbor/6.3.2 header [15:00:08] the old thumb is ... very weird? [15:02:13] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [15:04:06] hmm. weird Server header on the old one? [15:07:28] yeah it says ATS, but ... would you expect the original server header to be ATS ? [15:08:13] ofc the biggest telltale sign is the last-modified thing. 2017.. lol [15:09:24] btw I mwscript purgeList.php too [15:09:42] and even action=purge in the image's page [15:10:01] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff thanks for the tip. It did work however i am getting another error . Looks like we need to update the installer. Please let me know when it is done thamk... [15:10:01] but that 200px thumbnail stayed there until I did a swift delete [15:13:15] hey claime are you around? some reports in the o11y channel from pheudx that Page Previews graphite metrics broke around the time that https://gerrit.wikimedia.org/r/c/operations/puppet/+/883151 was deployed: https://grafana-rw.wikimedia.org/d/000000340/page-previews?orgId=1&refresh=1m&from=now-7d&to=now [15:13:34] cdanis: I am [15:14:00] So that means something is still relying on that huh [15:14:25] apparently yes [15:14:29] I'll join you on o11y [15:14:32] I'm not sure why it's failing in this way [15:14:38] and not just... using recdns [15:15:52] That, or if it's on mediawiki, why is it not using the hardcoded ip [15:21:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:22:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [15:27:55] <_joe_> did we just break those or more stuff? [15:29:55] Ok i got it [15:29:59] 2023/02/08 15:29:45 [STATSD] Error writing to socket: write udp [2620:0:861:103:10:64:32:23]:45594->[2620:0:861:102:10:64:16:81]:8125: write: connec> [15:30:15] Itś resolving ipv6 and I bet statsd is not listening on ipv6 [15:32:56] huh, there's nothing listening on 8125 on graphite1005 [15:36:30] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Brycehughes) @akosiaris yeah been in the industry for 15 years. Get both jokes. And understand the... [15:41:26] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Brycehughes) @akosiaris I'd also argue that naming things is this the hardest of all problems. But... [15:50:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [16:00:22] 10serviceops, 10Data-Persistence, 10Discovery-Search, 10SRE, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [16:01:17] 10serviceops, 10Data-Persistence, 10Discovery-Search, 10SRE, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) p:05Triage→03High [16:03:21] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10Jhawkinson) >>! In T328875#8597942, @akosiaris wrote: > I am gonna resolve this, feel free to reopen though. On what basis? As others explained initially, this isn't about one i... [16:11:38] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10Jonesey95) I performed action=purge, and it did not fix the problem. I think this ticket should be reopened for root cause analysis. This invalid thumbnail was apparently five... [16:17:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10MoritzMuehlenhoff) >>! In T326362#8598021, @Papaul wrote: > @MoritzMuehlenhoff thanks for the tip. It did work however i am getting another error . Looks like we need to update the... [16:27:48] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10TheDJ) >>! In T328875#8598319, @Jonesey95 wrote: > I performed action=purge while trying to troubleshoot this problem, and it did not fix the problem. I also purged, I always d... [16:39:17] <_joe_> claime: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/887806 fixes your issue I would say [16:55:06] hrm. who handles the care and feeding of etherpad? [16:55:55] either it's having problems on its own or i broke it with an attempted import. [17:09:06] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [17:21:48] brennen: maybe ak.osiaris or m.utante based on the edit history of https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org and rants I vaguely remember reading about the delightfulness of that software stack. :) [17:23:43] Did someone use emojis again ? :'D [17:47:45] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [17:54:59] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) >>! In T226931#8598125, @Brycehughes wrote: > @akosiaris I'd also argue that naming thi... [17:55:28] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) @Clement_Goubert @LSobanski @thcipriani I'd like to ping translators before the end of this week. Befo... [18:00:28] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [18:13:12] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10akosiaris) >>! In T328875#8598310, @Jhawkinson wrote: >>>! In T328875#8597942, @akosiaris wrote: >> I am gonna resolve this, feel free to reopen though. > > On what basis? As ot... [18:21:05] 10serviceops, 10Thumbor: Incorrect thumbnail being returned by drmrs, eqiad and esams - https://phabricator.wikimedia.org/T328875 (10Jonesey95) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_202#Image_thumbnail_bug was less than a month ago. This sort of question crops up on VPT every... [18:42:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [18:48:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff thank you. [18:57:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster executed with errors: - mw2420 (**FAIL**) - Remove... [19:12:41] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster [19:35:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @MoritzMuehlenhoff this did turn out to be a raid controller/disk issue and not a Debian installer issue. Sorry for the noise. [19:53:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [19:59:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [20:04:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2420.codfw.wmnet with OS buster completed: - mw2420 (**PASS**) - Removed from Pupp... [20:11:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [20:21:57] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [21:18:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2421.codfw.wmnet with OS buster [21:27:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2422.codfw.wmnet with OS buster [21:42:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [21:58:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2421.codfw.wmnet with OS buster completed: - mw2421 (**PASS**) - Removed from Pupp... [22:12:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2423.codfw.wmnet with OS buster [22:25:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2422.codfw.wmnet with OS buster completed: - mw2422 (**PASS**) - Removed from Pupp...