[05:27:25] marostegui: good morning, I think we should rethink all of RC table altogether T307328 [05:27:25] T307328: Scalability issues of recentchanges table - https://phabricator.wikimedia.org/T307328 [05:27:46] Amir1: sure, absolutely! [09:35:53] sigh. ms-be1059 won't power on [09:36:54] the iLO log shows the power on command being issued [09:39:20] it thinks both power supplies are working [09:40:12] let's try resetting the iLO again [09:48:16] sigh, not the iLO is unresponsive on ssh or https :( [09:48:19] s/not/now [09:52:51] pingable, but also unresponsive to ipmitool. [09:53:16] So I think I have no other options but ask the DC folk to cold power drain and see if it comes back more usefully? [09:58:42] if ilo isn't back yet then yes IME power drain is the next step Emperor [10:01:08] Emperor: anything here that can help you? https://wikitech.wikimedia.org/wiki/Management_Interfaces [10:01:16] * volans was never here [10:01:58] ghost of volans: I ran through the reset management card options, no joy [10:02:20] (and the host itself is off and I couldn't power it up from the iLO) [10:02:50] hw is never a joy, if all else fails dcops is your best chance at this point, sorry [10:02:58] ...the eqiad cluster upgrade is going _really_ badly. [10:03:16] yeah, I've opened T307667 to ask them nicely to turn it off and on again [10:03:17] T307667: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 [10:04:04] [only 3 backends done, and now it's blocked again until ms-be1059 is up once more] [10:04:32] sigh, any joy re: the disk ordering ? [10:06:42] godog: broadly, no. Sometimes it comes right enough after <=3 reboots, sometimes not. Or it's wrong in the installer, and a hdd gets reformated as if it were an SSD. [10:07:24] in codfw I've got to slightly-newer kit, and it _seems_ a bit more inclined to come up right with fewer reboots. [10:09:36] Emperor: ack, thanks for the update [10:11:07] I can't recall bumping up against "wrong in the installer" but yeah that's a tricky one [10:12:15] I've seen the ssds in anywhere between a and d [10:14:18] for even more context, of the course of normal operations the wrong ordering isn't a practical problem, it is though when replacing disks and puppet comes around to re-label [10:15:28] with at least some shufflings puppet won't run after the reimage, either. [10:15:58] (some it seems not to worry about, I've not bottomed out the difference; the SSD ordering seems particularly important to get right, hdds not so much) [10:18:27] yeah because after reimage puppet will want to mkfs the SSDs but not HDDs, in normal circumstances at least [10:18:28] and there's an element of constantly firefighting the state of both ms clusters so it's possible to make any progress on reimaging [10:22:03] indeed there's that too [10:22:05] * godog lunch [10:39:19] Emperor: do you remember who did you talk about maps with? I want to report an issue, but cannot even remember who maintains that [10:41:34] err, it was in this channel. [10:41:39] let me see if scroll helps [10:42:32] nemo-yiannis I think [10:42:42] hey [10:43:00] nemo-yiannis: hi - jynus would like to talk maps, I think :) [10:43:08] nemo-yiannis: while debugging an outage, I ran into high rate of maps errors [10:43:48] let me just show you the dashboard link and you (or your team) can hopefully proceed the best way [10:44:14] https://logstash.wikimedia.org/goto/ed03187f6844130a81a5552ec58c3fb7 [10:45:01] it was significat enough to show on global request rates errors, that is why I reasearch it further [10:45:10] it started at 3am UTC today [10:46:13] as it is like 6/7ths of total production errors, so seemed significant enough to report [10:49:07] was this a swift outage ? [10:49:50] so there was an outage, but that is unrelated [10:49:55] ok [10:49:58] the issue started before and continues [10:50:23] the only reason I am reporting here is because I was asking Emperor as he may have worked with you [10:50:44] and I didn't know who to contact [10:51:00] I can create a task with everthing I know, but the summary is that link :-) [10:51:26] 6 out of 7 5XX production errors we are generating is from maps [10:51:33] at the moment [10:52:44] Can you file a ticket with the details so i can share it with my team ? [10:52:50] sure [10:53:11] sorry I just didn't know who to report [10:53:17] nemo-yiannis: your team is? [10:53:28] content transform team [10:53:35] thanks, filing it now [10:53:35] we also have a maps project on phabricator [10:53:42] will add both tags [10:54:12] there is certainly something fishy ongoing, seeing other dashboards: https://logstash.wikimedia.org/goto/ed03187f6844130a81a5552ec58c3fb7 [10:54:22] ignore that link [10:54:37] I wanted to share this: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1&viewPanel=30&from=1651715947816&to=1651744177878 [10:54:59] arg, I meant: https://grafana.wikimedia.org/d/W67hDHEGz/maps-performances-jgiannelos?orgId=1&viewPanel=1&from=1651143235395&to=1651748035395 [11:05:50] i think i have an idea of what could be wrong, i will give some updates on the ticket [11:05:54] checking now [11:06:17] https://phabricator.wikimedia.org/T307671 [11:08:00] let's move conversation to ticket- it is offtopic on this channel, sorry [11:13:44] I am having trouble understanding https://mariadb.com/docs/reference/mdb/system-variables/gtid_domain_id/, help please? :-) was that setting added on 10.2, or is that just how far the history goes on the docs these days? [11:15:15] taavi: No, it was done in 10.1 [11:15:52] taavi: https://phabricator.wikimedia.org/T149418 [11:17:21] great, thank you! [11:23:42] https://phabricator.wikimedia.org/T301993#7906421 [11:24:23] taavi: Yeah, it is mostly used for multi-source [11:24:30] But it doesn't hurt to have it set [11:43:21] Amir1 taavi found it: https://phabricator.wikimedia.org/T307501#7906462 [11:46:16] Yeah makes sense. [12:11:58] jynus: I have created a task for dbproxy hosts not sure if you want a task for dbprov* and backup* or you prefer to use https://phabricator.wikimedia.org/T307668 [12:20:39] taavi: I just posted on your task, but saying it here too, don't go for 10.6 yet. I am still testing it [12:20:51] 10.4 should be safe to go from 10.1 (if you don't have any tokudb - which I believe you don't) [12:21:08] marostegui: I am good with the original task, thank you [12:21:28] cool [12:22:39] marostegui: ok, thanks! no tokudb as far as I can see [12:22:49] then 10.1 -> 10.4 is what we did in prod [13:30:05] grub-install fail on ms-be2051 (which probably means disk mis-ordering means its re-partitioned one of the hdds) [13:30:44] sending it round again... [13:30:59] [IWBNI the installer was a bit cleverer about "use the SSDs", FAI can be made to do thus] [13:46:18] no joy again, and it picked another hdd to write over :( [13:46:51] godog: did you have any joy with (re-)imaging the HP swift servers before? [13:51:16] Emperor: I can't recall for sure any glaring problems tbh, but it's been a couple of years I'd say [13:51:49] Emperor: I take it from scrollback that disk ordering is hitting again ? [13:51:59] having a third go at reimaging the first HP system ms-be2051 [13:52:41] AFAICT, what's happening is that the wrong drive appears as sdb, gets its partition table scribbled all over, grub-install fails, joy is unconfined in the lower bound [13:53:27] (and all the scribbled-upon drives will need manual repair and then a wait for swift to backfill) [13:55:09] sigh, yeah I can't recall that particular failure mode at d-i time, sorry [13:57:04] third time round the install finished OK, so maybe it got the right drives this time... [14:32:06] right, install is OK, now swift can backfill sd{c,f}1 [15:54:49] both, ms-be1059 properly hosed (it won't turn on at all, DC team tell me), going to need a bunch of work [15:54:59] bother, even. [16:29:30] Emperor: trying to help- I have 1 host that I was going to use to expand backups, but I haven't yet [16:30:05] would it help if I "borrowed it to you" for a couple of months so to unblock you? [16:30:32] when I say 1 host, I mean a (I think) 170TB, swift-sized host [16:31:55] I mean, if this host is totally nadgered, I could pull it from the rings and wait for swift to rebalance, but that's pretty disruptive (and time-consuming for the rebalance). [16:32:42] don't know much about that that, so maybe this could help? [16:34:10] I don't think so - if it's just a capacity thing, I could bring one of the new be nodes into service. [16:34:16] ok [16:34:26] just trying to help in anyway I know [16:36:26] Thanks, appreciated :)