[06:55:28] I am going to switch phabricator to read only for a minute in 5 minutes to complete https://phabricator.wikimedia.org/T328404 [07:02:23] This was done, read only time was 20 seconds [08:36:19] reminder, in 25 minutes I'll restart reimaging bast2002 bast1003, will post when they are good to use again [08:47:33] XioNoX: will you downtime all hosts in row A or should service owners do it? [08:55:22] marostegui: I'll downtime all the hosts, but some services (not covered with the "all hosts" selector) might need to be individually downtimed to not page/alert [08:55:38] XioNoX: roger, thanks [09:21:13] effie, hnowlan kartotherian is only pooled in codfw, but we're planning to fully depool the datacenter today [09:21:27] how should we handle it? [09:21:50] I will switch to eqiad now [09:21:54] no [09:21:57] no? [09:22:01] let me do it with the cookbook later :) [09:22:08] I just wanted to ensure it could be moved [09:22:55] yes thank you, we have finished all maint work [09:44:44] bast2002 can be used again [10:09:37] topranks: volans and I are testing the cookbook to depool codfw today for the maintenance [10:09:50] so if you get a page, yell at volans [10:10:00] :D [10:10:07] joe: haha will do :) [10:10:10] I've already told him to yell at you... [10:13:53] Just yell at the universe. [10:14:00] It doesn't care :p [10:20:40] bast1003 can be used again as well [12:50:01] fyi, switch upgrade communication will be done here [13:00:10] joe: am I correct to assume than your cookbook is the only action needed for all the ServiceOps servers listed in https://phabricator.wikimedia.org/T327925 ? [13:01:15] XioNoX: yes, just one doubt about registry2003, we might need to depool it individually [13:01:30] everything else should just fail but be irrelevant as long as codfw is depooled [13:01:44] and will it take care of ores and ms-fe? The instruction says to run `sudo depool` so I guess it's in etcd too? [13:02:08] it's a different kind of depooling [13:02:14] for ores, yes it will be ok [13:02:24] ms-fe, we're still writing to swift [13:02:36] so I would assume that is still needed [13:02:43] although pybal should detect it [13:02:50] and so, not *strictly* needed [13:03:23] XioNoX: I won't be around during the maintenance [13:03:32] joe: lucky you! [13:03:41] nah I'm inmeetings :P [13:03:49] but I was thinking I could depool codfw now [13:03:57] or do it in ~ 1 hour [13:04:02] a bit less than that [13:04:22] joe: doing it now works for me [13:04:31] I'll start downtiming stuff [13:04:39] ok [13:04:44] basically the less moving parts closer to the maintenance, the better [13:04:45] volans, topranks ^^ [13:05:45] +1 and thanks for the heads up [13:06:53] XioNoX: ms-fe> MW will continue to try and write to both DCs, so the ms-fe* nodes will need depooling. I am leaving this as late as will not annoy you, because we have had some issues relating to ms-fe capacity in the not vastly distant past [13:07:51] Emperor: noted, thanks! I was mostly wondering if joe's cookbook was depooling it as well or not [13:08:03] XioNoX: fyi i have disabled puppet in the sites that go to codfw [13:08:11] not from mediawiki writes [13:09:00] jbond: noted, is there a risk that disabling puppet cause issues with any depooling? [13:09:07] * volans back, reading backsroll [13:09:10] *backscroll [13:09:38] volans: waiting for your ok, I have my cannon ready to fire [13:10:08] hmm wont cause an issue with the depool but may cause an issue with the cookbooks? volans should i re-enable untill all cookbooks have been run [13:10:24] jbond: it won't with this cookbook [13:10:28] sukhe: I see that you're around, puppet has been disabled in codfw, is it ok to stop bird now? [13:10:29] but for others, indeed [13:10:43] sukhe: on doh2001 I mean [13:10:53] joe: ok for me to depool now discovery services unless anyone has capacity issue for unrealated reasons [13:11:06] not that we're aware! [13:11:18] jbond: what j.oe said, the discovery depool doens't need puppet [13:11:20] [13:12:07] ack well i think ill renable it for now just so it dosn;t get in the way of anyone elses cookbook. XioNoX we can run the disable just before you start th work it only takes about a min [13:12:20] note that mr1-codfw is also connected to row A so mgmt will be unreachable during the upgrade too, we shouldn't have alerting spam as we have parent/child relationships [13:19:06] vgutierrez: I see that you did the ns1 redirect on cr2-codfw, I'm going to push it to cr1-codfw as well [13:33:40] XioNoX, volans https://grafana.wikimedia.org/d/000000519/kubernetes-overview?orgId=1&viewPanel=9&from=now-1h&to=now [13:33:52] that looks depooled to me [13:34:02] ahahaha [13:34:06] joe: thx, it took 21min for the cookbook to run? [13:34:10] I'm not sure... it's not that clear [13:34:13] :-P [13:34:21] I am checking enwiki DBs and they have no reads now :) [13:34:30] joe: to the moon! wait, no [13:34:54] XioNoX: yes but we're doing checks at each depool, we're thinking to eventually add a fast-unsafe option to depool first and then do the checks for more emergency situations [13:34:57] to be discussed.. [13:34:59] XioNoX: thanks [13:35:21] volans: good to know, thx! [13:37:48] I'm going to relocate before the upgrade, back in 10/15min [13:39:11] I'm afk as far as production help goes for the rest of the day [13:54:25] other than puppet, are any depools pending? (cc Emperor) [13:54:34] XioNoX: there's a deployment window for mediawiki at the same time as the maintenance [13:54:42] I guess you might want to ask to postpone it? [13:55:24] XioNoX: going to do them now [13:57:00] XioNoX: ms-fe2009 depooled, swift is ready now [LMK when done, and I'll repool] [13:57:06] XioNoX: thanks for your patienc3 [13:57:14] awesome, thx! [14:00:55] XioNoX: I'm around as oncall, let me know if you need anything fom me [14:00:57] *from [14:04:02] thanks! [14:05:33] around too [14:06:24] alright, restarting now [14:06:39] (after double checking I'm sshed to the proper switch stack) [14:06:47] good luck :-) [14:06:58] 🍿 [14:07:06] 🤞 [14:07:07] "System going down in 1 minute" [14:07:12] * claime braces [14:07:17] 🍀 [14:07:46] XioNoX: one sec need to disable puppet [14:07:51] * jbond doing now [14:08:33] jbond: I did it [14:08:52] yeah, puppet is disabled in codfw/esams [14:08:53] ahh ok [14:09:52] gl folks! [14:15:12] Emperor: now that codfw is depooled.. could you perform a rolling restart there? https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=upload&var-origin=swift.discovery.wmnet&from=now-7d&to=now&viewPanel=12 [14:15:24] Emperor: swift wasn't doing great in terms of 5xx there [14:16:59] switches boot process are finishing [14:17:06] XioNoX: nice <3 [14:18:14] hi, is there anyway to prevent a Debian package from starting a systemd service? I would like to start it manually after some extra configuration is done ;) [14:18:29] masking the unit [14:18:50] then it does not exist until I install the package :-\ [14:18:59] I suspect hashar meant on the apt(-get) commandline [14:19:10] hashar: are you the packager? [14:19:18] (at least that's what I would want, I might not even know the systemd unit name before installing the thing) [14:19:22] or just installing it? [14:19:42] see systemd::mask [14:19:51] my use case is `apt install jenkins` which comes with untweaked systemd file and some lame default properties [14:20:04] Now that I think of it, there are systems where I'd set up apt in a way that it never auto-starts a systemd unit [14:20:04] so I need to add a systemd override to tweak the service however fits my need [14:20:05] there is a large maintenance ongoing, can this wait to make sure this channel is available for coordination? [14:20:05] all switches seem back online on the virtual chassis output [14:20:10] and after that start the service [14:20:21] ack [14:20:28] XioNoX: awesome [14:20:41] oh you are doing an op there? :) sorry, I will ask again later ;) [14:20:43] vgutierrez: :( will do once puppetdb is available [14:20:48] ssh asw-a-codfw.mgmt.codfw.wmnet works [14:20:49] Can confirm login is possible on two of our nodes (ores 200[12]) [14:21:08] a bunch of recoveries coming in in icinga too [14:21:20] lvs2007 is reachable as well :) [14:21:26] vgutierrez: I _could_ do by hand, but I think waiting for the cookbook to be runnable OK is better [14:21:44] Emperor: ack [14:22:02] yeah it looks full online and healthy [14:22:02] new ms-fe hardware has arrived, I'm just waiting for it to be ready [14:22:05] still checking [14:22:28] I am going to reload db proxy 2002 [14:22:29] XioNoX: can we hold off repooling codfw until I've managed to roll-restart swift there, please? [14:22:43] Emperor: no pb :) [14:22:56] (still getting ECONNTIMEOUT to puppetdb2002) [14:22:59] so we're at ~15min hard downtime for the upgrade [14:23:17] i can get onto puppetdb2002 [14:23:23] same here [14:23:23] From -operations, depooling appservers went under the radar, I just depooled api_appserver/appserver/jobrunner/parsoid [14:23:33] Tell me when I can repool [14:23:42] got paged [14:23:45] for api [14:24:00] https://portal.victorops.com/ui/wikimedia/incident/3414/details [14:24:11] ack it, dc is depooled [14:24:15] jbond: the roll-restart swift cookbook is still failing thus: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='puppetdb2002.codfw.wmnet', port=443): Read timed out. (read timeout=30) [14:24:23] and appservers too [14:24:30] jbond: (running from cumin2002) [14:24:46] ahh ok the service port not the whole host checking [14:24:47] Of course I forgot to silence first... [14:24:57] volans: sorry about that ^ [14:25:28] Emperor: can you try now [14:25:33] ack [14:25:46] jbond: cookbook started OK now, thanks [14:25:52] np [14:26:34] XioNoX: fyi puppet is looking healthy so we can re-enable whenever [14:26:50] ganeti/codfw seems all fine as well [14:26:58] jbond: thanks go for it [14:27:18] * jbond enabling [14:27:20] Yeah all checks network wise seem good that I've done [14:27:52] * Emperor still not happy about swift in codfw [14:28:30] * jbond puppet re-enabled, re-running puppet on failed [14:28:48] claime: BTW, appservers/parsoid/api depool was needed? considering that sre.discovery.datacenter-route has been used? [14:28:51] Can I repool mw? [14:28:57] vgutierrez: They're not discovery [14:29:26] claime: wait; Emperor, which repool do you want us to hold before your rolling restart? [14:29:33] And tbf I'd rather make double sure. [14:29:50] XioNoX: Yeah yeah, justing putting it out, tell me when and what Emperor [14:30:38] claime: WDYM? ATS uses (appservers|api-{ro|rw}.discovery.wmnet to reach them [14:30:54] XioNoX: codfw, sorry, since that will repool codfw swift which I'd like to finish restart first; not least because ms-fe2009 was unhappy [14:31:18] vgutierrez: Hmm. I have overreacted to the PyBal alarm then? [14:31:25] Should have just silenced it? [14:31:32] claime: which alert? [14:31:40] PyBal backends health check [14:31:44] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2009&service=PyBal+backends+health+check [14:32:33] claime: yeah.. those hosts lost networking momentarily (as expected) [14:32:36] As well as "api-https:443 failed when probed by http_api-https_ip4 from codfw" that paged volans [14:33:25] XioNoX: what's the current status? [14:33:59] I'll repool what's not discovery for mw then vgutierrez akosiaris ? (api_appserver/appserver/jobrunner/parsoid) [14:34:07] claime: yeah, silencing might have been a better approach. Note that you can't go below 70% anyway [14:34:14] volans: network healthy, puppet repooled, and waiting for swift for more services repool [14:34:27] as in, you can't depool more that 30% of appservers anyway, pybal will not honor your request [14:34:30] akosiaris: confctl will happily set them to depooled but PyBal won't actually depool right? [14:34:35] yes [14:34:38] claime: and that will trigger more alerts [14:34:40] Tripped up again [14:34:48] Sorry :) [14:34:53] s/:)/:(/ [14:34:55] you can pool them all anyway I think [14:34:55] nah, it's fine [14:35:06] [insert it's fine meme here] [14:35:21] * volans has a question why more than 30% of appservers are in a single row... [14:35:25] for later [14:36:20] not sure there are. I was just pointing out pybal's depool_threshold setting [14:36:33] ahhh ok then :) [14:37:04] There aren't I think, it's just I did a full depool of all appservers via confctl, alex was pointing out it was useless anyways :p [14:37:07] service page recovered [14:37:23] !incidents [14:37:24] 3414 (RESOLVED) [FIRING:3] ProbeDown (ip4 probes/service ops page codfw prometheus sre) [14:37:24] 3413 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.77 ip4 thanos-web:443 probes/service http_thanos-web_ip4 ops page codfw prometheus sre) [14:37:41] claime: re https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1 alert, IIRC it's gonna try to perform a request against every backend server [14:38:05] so as soon as some backend servers lose network connectivity the alert is gonna be triggered [14:38:17] XioNoX: codfw swift restart done, sorry for the delay. I release my lock on repooling :) [14:38:59] not correct no, the blackbox probes are performed against the vip [14:39:01] Emperor: cool, not an issue at all [14:39:15] in this case we were getting connection refused on e.g. appservers.svc.codfw.wmnet:443 [14:39:19] godog: oh ok [14:39:36] so... that was actually triggered by claime depool [14:39:42] godog: for reference on 21st Feb when row B gets done, I had to clean up a bunch of unhappy services on the affected frontend (restart or reset-failed) [14:39:46] to all: you can resume the services repool [14:39:48] vgutierrez: Yeah, I messed up. [14:39:54] Emperor: thank you [14:40:05] claime: messing up in a depooled DC doesn't earn you a tshirt, sorry ;P [14:40:20] vgutierrez: I'm taking care of ns1 [14:40:26] Too anxious to be t-shirt hunting right now tbh :D [14:40:27] XioNoX: <3 [14:40:56] sukhe can you take care of doh2001 or should I? [14:40:59] when you want I can re-run the discovery cookbook to repool every a/a in codfw [14:41:03] vgutierrez: on it [14:41:10] * Emperor will never not think of Homer Simpson apropos that hostname [14:41:19] Emperor: lol [14:41:34] Emperor: :P [14:41:54] I have a higlight for "doh", you can imagine the context in where it gets triggered [14:42:05] (more for doh as in Homer than in the hostname) [14:43:31] is it ok to tell the deployment people they can proceed with it? [14:44:00] We should maybe finish repooling everything before, no? [14:44:06] yeah Id say so [14:44:13] ok to run the sre.discovery.datacenter-route cookbook again? [14:44:44] the instructions I have is to pool everything in codfw and then depool restbase-async from eqiad [14:44:48] is that correct? [14:45:04] lgtm, akosiaris ? [14:45:44] LGTM [14:45:58] ok proceeding then [14:46:02] oh, can I see your tmux ? [14:46:11] I haven't seen this cookbook output yet [14:46:13] sure, my user on cumin2002 [14:46:21] it's slow and verbose :D [14:46:30] * volans running cookbook sre.discovery.datacenter-route --reason T327925 pool codfw [14:48:03] we were discussing UI/speed improvements [14:48:04] we need to make that 3s retry exponential :P [14:48:21] so it will take more time? :D [14:48:52] and emit very useful and well known messages like CrashLoopBackOff [14:48:52] "ways in which cookbooks are like Debian releases..." ;p [14:48:56] to follow the repool: https://grafana.wikimedia.org/d/000000519/kubernetes-overview?orgId=1&viewPanel=9&from=now-1h&to=now [14:49:01] wait, it is skipping all mw services, so either I'm missing something, or the appserver/etc services weren't actually depooled [14:49:18] it's skipping the rw ones [14:49:23] *-rw [14:49:28] Ah, all right [14:49:39] skipping all a/p services [14:50:35] PyBal is still squeaking about hosts being down PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2407.codfw.wmnet, mw2391.codfw.wmnet, mw2408.codfw.wmnet, mw2389.codfw.wmnet, mw2384.codfw.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2295.codfw.wmnet, mw2298.codfw.wmnet, [14:50:37] mw2402.codfw.wmnet, mw2294.codfw.wmnet, mw2299.codfw.wmnet, mw2405.codfw.wmnet are marked down but pooled [14:52:45] claime: what's their status in ipvsadm ? [14:52:57] hnmmm [14:53:18] did pybal loose the connection with etcd and need to be restarted? [14:53:24] and didn't see the repool [14:53:29] cgoubert@lvs2009:~$ sudo ipvsadm -L | grep mw2407.codfw.wmnet [14:53:31] -> mw2407.codfw.wmnet:https Route 30 55 11 [14:53:46] lvs2009 is just complaining about wcqs and search-omega [14:53:56] everything else looks good [14:54:00] ack [14:54:05] inflatador: ^^ [14:54:25] vgutierrez: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2009&service=PyBal+backends+health+check ? [14:55:13] claime: yep... not fresh according to pybal logs [14:55:36] :eyes [14:55:43] vgutierrez: a'ight, I forced a recheck and it didn't change, that's why I was trying to understand what was going on [14:55:50] claime, akosiaris: once this finishes it's ok to run: cookbook sre.discovery.service-route --reason T327925 depool --wipe-cache eqiad restbase-async ? [14:55:50] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [14:56:27] volans: I have no idea if the order of the arguments is ok, but the premise LGTM [14:56:33] claime: yeah, something is off with that alert [14:56:40] using mw2299 as an example [14:56:43] Feb 07 14:23:08 lvs2009 pybal[868]: [api-https_443] INFO: Server mw2299.codfw.wmnet (disabled/partially up/not pooled) is up [14:56:51] got detected as up at 14:23:08 [14:56:52] akosiaris: ack, thx, I checked the help messages, so it should be good :D or fail badly [14:57:03] and got repooled at 14:36:06 (according to pybal) [14:57:27] checks out with my logs [14:57:29] 20m lag? that's too much, something's off [14:59:15] no, wait [14:59:34] running the check locally still returns that [14:59:52] usr/local/lib/nagios/plugins/check_pybal --url http://localhost:9090/alerts [14:59:53] PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2407.codfw.wmnet, mw2391.codfw.wmnet, mw2408.codfw.wmnet, mw2389.codfw.wmnet, mw2384.codfw.wmnet are marked down but pooled; thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled; api-https_443: Servers mw2396.codfw.wmnet, mw2295.codfw.wmnet, mw2298.codfw.wmnet, [14:59:53] mw2402.codfw.wmnet, mw2294.codfw.wmnet, mw2299.codfw.wmnet, mw2405.codfw.wmnet are marked down but pooled [14:59:56] yep, :9090/alerts reports that [15:00:11] so this isn't icinga failing... [15:00:20] nope.. pybal isn't super happy [15:00:25] let me restart it [15:00:30] vgutierrez do you have any more specifics on the lvs2009 complaints re: wcqs and search-omega ? I don't see any alerts [15:00:59] inflatador: wait a sec, he's restarting pybal on lvs2009 [15:01:30] Feb 07 15:01:12 lvs2009 pybal[18020]: [wcqs_443 ProxyFetch] WARN: wcqs2001.codfw.wmnet (enabled/up/pooled): Fetch failed (https://localhost/readiness-probe), 1.543 s [15:01:30] Feb 07 15:01:12 lvs2009 pybal[18020]: [wcqs_443] ERROR: Monitoring instance ProxyFetch reports server wcqs2001.codfw.wmnet (enabled/up/pooled) down: 404 Not Found [15:01:37] inflatador: that [15:02:17] ACK , will check it out. Thanks vgutierrez and claime ! [15:02:17] I suspect that pybal didn't like at all getting all the backend servers depooled [15:02:34] $ curl http://localhost:9090/alerts [15:02:34] OK - All pools are healthy [15:02:37] looking good now [15:02:47] phew. sorry again. [15:03:22] pybal alerts recovering [15:03:48] I think the cookbook for datacenter-route would benefit from logging how many services it has to depool/repool and give some updates on progress in the list [15:04:34] mmhh could lvs2010 have the same problem ? I see thanos-fe2001.codfw.wmnet marked as down but it is pooled [15:06:33] Not the same symptoms, :9090/alerts is clean on lvs2010 [15:07:11] godog: It went green again [15:07:22] claime: yes it's a good addition [15:07:29] godog: fixed :) [15:07:30] yeah I think because vgutierrez restarted pybal <3 [15:07:32] (by a restart) [15:07:35] lol [15:08:11] Ok repool's done [15:08:15] cookbook finished, ok to depool restbase-async on eqiad? [15:08:40] yeah [15:09:15] (you're basically rehearsing what I'll have to do end of the month, thanks for giving it a shakedown) [15:09:18] done, Waiting 296.01 seconds for DNS changes to propagate [15:09:28] lol [15:09:37] I think we can tell deployers go? [15:09:59] I think so [15:10:06] XioNoX: anything else [15:10:07] ? [15:11:01] https://gerrit.wikimedia.org/r/c/operations/dns/+/886984 needs to be merged, reviews welcomed ;P [15:11:35] any pending alert not recovered yet? [15:11:36] volans: fine by me :) [15:15:37] httpbb stuff [15:15:48] underreplicated kafka [15:16:10] Emperor: BTW, https://grafana.wikimedia.org/goto/if2te80Vz?orgId=1 is due to the extra load? [15:16:50] claime: can I leave it to you to mark the steps in the tasks for serviceops? [15:17:35] volans: I don't think I understand what you mean [15:18:19] I thought we had to update the Status column in the task description in T327925 [15:18:20] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [15:18:28] but maybe I'm wrong :) [15:18:39] vgutierrez: suspect so; as I say, there are more ms-fe* hardware in the DCs but not ready for me to work on yet [15:19:12] volans: Ah right, will do yeah [15:19:21] I don't see it used much [15:29:54] [15:35:17] Marostegui: regarding db2181, is it still unreachable for you? [15:37:17] JennH: yep [16:24:42] Odd question, but has anyone done a major cleanup of swift buckets lately? I'm just trying to get an estimate on how long it would take to remove ~150k objects. Last time I did it w/concurrency of 5 and it was around ~200 objects/min [16:24:57] re: https://phabricator.wikimedia.org/T316031#8179140 [19:08:54] we got the "uncommitted DNS changes" alert, but the reason is not the normal reason and instead "An error occurred checking if Netbox has uncommitted DNS changes" [19:15:02] mutante: the NRPE check run on netbox1002 returns 0 and all good [19:16:28] also from alert1001 is successful [19:16:29] volans: ack, thanks. it just recovered [19:16:34] I'm forcing a new check [19:16:44] it was transient I'd say, not sure why [19:16:48] seems already over, yes, indeed [19:47:09] I suspect this is a very basic question: I want to get a list of hosts with a given puppet role (for firewall and db grants). Can someone point me to an example, or at least a keyword to grep for? [19:47:33] I see wmflib::puppetdb_query but mostly only very complex uses [19:48:14] andrewbogott: look at the wmflib::*::hosts functions [19:48:32] I will! ty [19:50:07] andrewbogott: a list of hosts with a particular Puppet role applied? just a list output? [19:50:21] andrewbogott: something like wmflib::role::hosts('cluster::management') but if the host itself where you run the puppet code must be included in the list too, then you need to add it because at the first puppet run it will not be part of the puppetdb query [19:50:31] so in that case something like: [19:50:31] (wmflib::role::hosts('cluster::management') << $facts['networking']['fqdn']).sort.unique [19:53:35] yes, just a list, and not including the host itself. so I think this will be pretty easy once I get the right role to query