[06:47:16] XioNoX topranks for today's switch maintenance, it is 14:00 UTC, were you aware that EU would change to summer time? (asking cause it caught me by surprise during a db switchover I scheduled for yesterday a few weeks ago) [06:48:41] marostegui: 14:00 UTC. It was set to 14:00 UTC before the timezone change and got carried over because of sprint week. So better stick to what's in the task to reduce confusion [06:49:23] XioNoX: yeah, what I am saying is that if you are aware that 14:00 UTC now is different LT than when it was originally scheduled :) [06:49:35] Honestly I hadn’t factored in the time change until this week. But +1 no need to change it. [06:49:49] To be clear, I am not asking for a change, I am just bringing awareness :) [06:49:50] next ones are set to 13:00 [06:49:55] roger [06:49:58] marostegui: to what time do you want to change it? [06:50:01] jk :) [06:50:04] XD [06:50:27] but thanks for the ping! [09:21:56] hey folks, we started receiving an increase number of phab tickets for alerts, and we can't explain why this increased rate is happening, context is T333315 rings any bell? [09:21:56] T333315: WMCS: hundred of phabricator tickets were created for some alerts - https://phabricator.wikimedia.org/T333315 [09:28:17] I think this might have something to do with sprint week (last week) efforts to migrate alerts from icinga to AM. cc jbond, godog that were mostly involved [09:29:31] *nod* will take a look volans arturo [09:29:37] thanks godog volans [09:30:32] im gussing everything tagged wmcs gets a ticket and nwo we are tagging more generic alerts with team tags. happy to help godog but ill likley just slow you down ;) [09:31:15] but yes the long and short of it is what volans said, we've migrated "systemd unit failed" from icinga to alertmanager and send out the alerts per-team [09:31:37] godog: while at it, could you please add me to https://phabricator.wikimedia.org/project/profile/13/ so I can mass-edit all those tasks? [09:32:06] arturo: sure, {{done}} [09:32:39] godog: thanks! [09:33:07] arturo: I'm guessing you (team) would rather not receive tasks for systemdunitfailed alerts in this case ? i.e. just show up on alerts.w.o ? [09:33:30] godog: I think so, yes, at least until we get a sense of the rate of the events [09:35:02] arturo: *nod* I'll ack the systemdunitfailed alerts for team=wmcs for now [09:35:16] thanks! [09:36:09] sure np [09:36:30] hmm do you know if we can powercycle a server without access to the mgmt console? [09:37:00] vgutierrez: I think you can use ipmi tool from the cumin servers [09:37:12] vgutierrez: redfish [09:38:04] vgutierrez: define without access, which part doesn't work? ping, ssh, webUI, redfish API, remote IPMI [09:38:14] volans: I can't ssh cp2035.mgmt [09:38:39] and it's having some IPMI issues as well [09:38:46] doesn't respond to ping [09:38:52] or htps [09:38:57] or https [09:39:23] vgutierrez: the host is up, have you tried to reset the bmc? [09:39:40] https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card [09:39:47] nope, no action besides depooling it so far [09:39:49] thx [09:42:25] that failed as well [09:42:34] ipmi_cmd_cold_reset: driver timeout [09:43:53] vgutierrez: you might need papaul then... [09:46:10] Silly question, if the host is up and we can SSH in, can't we just `shutdown -r now`? [09:46:52] that won't trigger a power cycle but a soft reboot? [09:48:10] godog, jbond: FYI we got also 61 SystemdUnitFailed alerts in -operations in the last 23h (when they started)... seems a bit spammy [09:49:49] volans: agreed, I think we should downgrade to warning at least for now cc jbond [09:50:15] godog: fine with me [09:51:32] ok thank you! that'll stop notifications on -operations, depending on team settings/routing the warnings might get sent to irc too (different discussion though) [09:51:38] however for now we still have the icinga one which is critical [09:53:07] indeed, the difference there is that by default am will re-notify firing alerts after 2h IIRC [09:53:22] ahh i see [09:54:10] looking in foundations i think every 4h but yes that fills in some gaps i had in my head :) [09:54:16] vgutierrez: Yeah, I take your point. Thought it might be worth a go although I agree it won't fully reset the BMC. [09:54:28] jbond: ah yeah 4h is more likely indeed [09:54:47] a bunch of things to tweak still but moving in the right direction [09:56:16] this is repeat_interval setting in alertmanager route FWIW https://prometheus.io/docs/alerting/latest/configuration/#route [10:15:15] <_joe_> FWIW I think the alerts in #-operations are useful - maybe we can configure these to re-fire every 24 hours? [10:17:50] _joe_: is that for systemdunitfailed or all criticals ? [10:18:18] <_joe_> systemdunitfailed probably [10:21:34] not sure if It provides any useful info, but I ended up filtering those out on my alertmanager dashboard because they weren't useful to get an overview of the cluster health [10:21:52] (thinking out loud) special casing alert names in alertmanager config to tweak repeat_interval is not sustainable, though through another label to signal the alert scope might be workable [10:23:18] the thing is, a dashboard is useful, but maybe there could be some kind of "time-sentitivity" tag or dimension (e.g. backup alerts are important to me, but rarely time-sensitive) [10:25:45] yeah something along those lines, will need a bit of refinement but definitely doable [10:28:40] the phab integration seems to always create a new task for new alerts and never close them, which is especially annoying for systemd alerts because they are often flapping as the service goes between running and failed [10:31:23] yeah per-unit-failed task is a no-go for sure, though I think when an alert is resolved the bot is supposed to close the task [10:31:54] at the same time I'm trying and failing to find an example [10:55:50] remember when I annoy you for backus failing for a newly setup host? It turns out sometimes it fails for a reason (context: T331896), so it is sometimes useful for me to be that annoying 0:-)! [10:55:51] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [12:53:40] sukhe: https://gerrit.wikimedia.org/r/c/operations/dns/+/903642 [12:56:41] XioNoX: only dns1003 is left from Traffic's side, I will depool it a bit closer to the event [12:57:07] sukhe: good to deploy the DNS change now? [12:57:33] yep. want me to do it? [12:57:34] Head's up everybody, eqiad row B upgrade is happening in 1h [12:57:43] sukhe: no preference :) [12:57:56] go for it then :) [13:00:43] sukhe: done [13:00:54] akosiaris: thanks for the depool! [13:01:06] I was going to ask who will take care of it [13:01:34] yw [13:10:34] XioNoX: thanks! [13:11:32] sobanski: hello! there are a few servers with no indications for "ServiceOps-Collab" in https://phabricator.wikimedia.org/T330165 do they need any kind of special care? [13:37:21] jelto, mutante, eoghan, arnoldokoth ^ [13:39:39] downtime for otrs and phab is/will be communicated, I'll add NONE to the table. CC:sobanski [13:39:44] XioNoX: fyi ill do idp once ci passes https://gerrit.wikimedia.org/r/c/operations/dns/+/903648 [13:39:58] jbond: thanks! [13:40:06] jelto: thanks too :) [13:40:09] Thanks jelto [13:40:34] XioNoX: fyi, we repooled eqiad in discovery, but depooled the two affected thumbor nodes in row B, because thumbor codfw couldn't handle the load on its own [13:40:55] cf -operations [13:41:14] noted, thx [13:42:01] hnowlan: for the core platform team servers, should I do the depool? [13:42:21] XioNoX: if you could, please! [13:42:34] on it! [13:42:58] elukey: for the ores servers, should I do the depool? [13:44:19] XioNoX: already done! [13:44:44] awesome [13:45:34] balloons: all good on the WMCS front? [13:46:05] I've stopped YARN queues from accepting more jobs into Hadoop, but I'm going to give a few more minutes before I put HDFS into read-only mode. [13:48:29] I expect that we will see some alerts relating to the loss of an-coord1001 on dependent services such as from Druid, Airflow, DataHub, Turnio. I have downtimed some, but I will have to be on the lookout for others and apologise in advance for any noise. [13:49:40] no worries at all! [13:56:52] jbond: can you take care of "disable puppet in eqiad/esams/drmrs" ? [13:57:00] godog: you taking care of thanos-fe1002 or do you need me to? [13:57:00] HDFS in read-only mode. DE is as ready as we can be. [13:57:30] Emperor: I'll take it [13:57:45] XioNoX: sure [13:57:50] godog: TY :) [13:58:01] sure np [13:58:07] XioNoX: ms-swift ready [13:58:13] awesome [13:58:33] jbond: is ldap-replica1003 for you too? [13:58:49] not sure if it's jsut a sudo -i depool or something else [13:59:05] XioNoX: sure i can look at that as well [14:00:23] I think at this point I only haven't heard from WMCS (cc balloons, andrewbogott, arturo) [14:00:42] once jbond is ready I think we're all set [14:01:23] XioNoX: we are ready! [14:01:41] XioNoX: should all be done [14:02:16] XioNoX: Are you planning to do downtime things? I'm worried that lvs will page people about labweb hosts [14:02:21] (although in theory I've already removed that page) [14:03:26] I ran a global downtime "sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row B upgrade" -t T330165 'P{P:netbox::host%location ~ "B.*eqiad"}'" [14:03:27] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [14:03:41] but some things might go through the cracks [14:04:21] XioNoX: splendid. [14:04:42] thanks everybody, now is the last time to speak up before the upgrade [14:05:49] let's go! [14:06:04] 🍀 [14:06:09] System going down in 1 minute [14:06:17] XioNoX: gl! [14:10:36] did we fail over gerrit? [14:11:30] guessing we just took the downtime :) [14:12:52] gerrit should not be affected, just CI (contint) [14:13:25] yeah agreed, it's not in the list https://phabricator.wikimedia.org/T330165 [14:13:39] is it not timing out for others? [14:13:44] but is not responding either to me [14:13:55] ping gerrit.wikimedia.org [14:13:55] PING gerrit.wikimedia.org (208.80.154.137): 56 data bytes [14:13:56] Request timeout for icmp_seq 0 [14:14:13] netbox too while it's in ganeti row A [14:14:38] +1 for netbox timeout, wanted to check row of gerrit there [14:15:15] yeah, gerrit's down for me [14:15:33] Also strangely https://yarn.wikimedia.org is 502. Not terribly important right now, but unexpected. [14:15:38] down for other people too [14:15:41] note that 3/8 switches are coming up as expected [14:16:10] the others are slower (still as expected) [14:18:05] 5/8 up [14:18:27] gerrit is back [14:18:33] sweet [14:18:56] netbox too [14:19:02] and yarn. [14:19:10] yep, perfect [14:19:32] for gerrit it's in B8, so why didn't I put it on the task [14:20:10] XioNoX: happy to do the revert and deploy, let me know when you want [14:20:32] better gerrit that something more external. Maybe it got migrated since filing or something? [14:20:47] <_joe_> jynus: I doubt it [14:21:01] sukhe: doing some checks [14:21:42] gerrit1001:~$ ls -al /etc/wikimedia/contacts.yaml [14:21:43] -rw-r--r-- 1 root root 38 Mar 1 19:01 /etc/wikimedia/contacts.yaml [14:22:03] maybe it didn't have any "owner" when I ran my script to collect hosts (and I didn't notice it) [14:23:42] alright, network is healthy [14:23:48] you can proceed with repools [14:24:12] <_joe_> akosiaris: are you handling the services layer? [14:27:10] hnowlan: I repooled maps and restbase [14:27:15] XioNoX: thanks! [14:29:02] sukhe: https://gerrit.wikimedia.org/r/c/operations/dns/+/903666 [14:29:59] XioNoX: thanks, will deploy [14:31:52] _joe_: yes [14:32:04] <_joe_> ack [14:45:11] I'm seeing puppet agent hanging on a few hosts. Anyone else seeing the same? puppetboard looks quite red. [14:45:17] yep [14:45:38] I pinged jbond in -operations to see if something is up but example of failure https://puppetboard.wikimedia.org/report/dns1001.wikimedia.org/810719d816acdcfa7d86149dfa2c240d195ab40a [14:45:49] +1 thanks. [14:47:03] they could just be a result of the network issue [14:47:19] I think we could run https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [14:47:26] and see what's left afterwards [14:47:32] im looking at puppet [14:47:46] i messed up the disable command so things are hammering the pupet master nwo [14:48:27] volans: at this poitn i think its best to just wait the 30 mins. i was running the `run-puppet-agent --failed-only` but there are so many that that is making things woirse now [14:48:37] ack [14:48:58] volans: Thanks. It's only three hosts I care about, for applying a specific revert. I'm happy to wait for the all clear. [14:49:29] btullis: you can propbably kick them of manually nwo and i think they should complete [14:49:43] (if its only two or three) [14:49:54] so [14:49:58] ==> Do you wish to rollback to the state before the cookbook ran? [14:49:58] Type "go" to proceed or "abort" to interrupt the execution [14:50:07] so... if I type go, what happens ? [14:50:11] if I type abort ? [14:50:38] I also would just prefer to skip, but that's a different story [14:51:15] jbond: thanks, will do. (an-master100[23] and the an-launcher1002) [14:58:36] I think go rolls back, aborts just quits out and leaves as is iirc [15:08:08] which team owns irc1001/2001 ? [15:11:31] <_joe_> lol [15:11:36] ^ [15:11:49] <_joe_> XioNoX: I/F [15:12:01] <_joe_> if you mean which SRE team [15:12:09] XioNoX: the team is called UO (unclear owner) [15:12:11] <_joe_> if you mean the service from a dev POV [15:12:22] <_joe_> that's "lol" as I said [15:12:28] <_joe_> sorry, the multimedia team [15:12:30] thx, and lists1001 ? [15:12:34] Lost Owner in Limbo [15:12:37] same [15:12:56] XioNoX: check https://phabricator.wikimedia.org/T325132 [15:13:13] According to the training checklist, lists is owned by Amir.1 :P [15:13:30] jynus: that's exactly what I'm fixing [15:23:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/903686 [15:28:00] btullis: not sure if releated to the maintenance, known or expected, but FYI the webrequest_sampled_128 druid dataset has no data since 12 UTC (maybe one hourly job failed, so might not be that big of a problem, but I thought to mention it) [15:35:24] volans: Thanks, yes I think it's related to the fact that I disabled ingestion into HDFS. If it doesn't do it automatically, I will look at back-filling it manually. [15:35:47] ack, thanks! no problem [15:55:03] volans: Does that look right to you now? I think it back-filled automatically, once I re-enabled the gobblin timers. [15:55:24] Let me know if not and I'll do ...something about it. [15:57:09] btullis: mmmh I don't see new data, but it might take a bit, not a problem if it just need time ;) [15:57:12] https://w.wiki/6WDC [16:00:17] Oh yeah, was looking in the wrong place. Doh! OK, I'll come back to this after meetings. [16:00:39] np, thx [16:02:07] eqiad row D upgrade task is out! https://phabricator.wikimedia.org/T333377 [16:12:40] Hi! if you have a few mins this week please provide feedback for sprint week: https://forms.gle/yRpGWHobvXBvA8WS9 [16:32:49] topranks: could the row C upgrade be delayed to Wednesday or Thursday? https://phabricator.wikimedia.org/T331882 [16:39:16] andrewbogott: Thursday is a day off in many European countries (Easter) [17:21:12] good to know! We'll figure this out in the sync meeting tomorrow.