[07:49:45] <federico3> Amir1: can I start reboots in s6 in eqiad and s3 in codfw? [08:17:26] <marostegui> I am going to reboot the non active proxies [08:25:16] <Emperor> Morning all, could someone +1 https://gerrit.wikimedia.org/r/c/labs/private/+/1151605 please? Adding an apus account to labs/private [08:58:41] <Emperor> (this has been done) [09:12:50] <marostegui> federico3: I think this sort of broke the upgrade cookbook: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151219 because I used it (so I didn't add -t) and when the cookbook finished I got: https://phabricator.wikimedia.org/P76552 [09:15:04] <federico3> ah I missed one of the task_comment entries, just a sec [09:15:16] <marostegui> thanks [09:16:52] <federico3> https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151620 [09:17:37] <federico3> I can run the upgrade myself with test-cookbook -c 1151620 or you can do it on your side so that we do a real end-to-end test (not dryrun) before merging? [09:17:54] <marostegui> Ok, I will do in a bit [09:18:37] <federico3> let me open a cleanup task for better safety [09:25:19] <marostegui> federico3: confirm: test-cookbook -c 1151620 sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet ? [09:26:30] <federico3> yes but I would always do a dry run immediately before :) [09:26:41] <federico3> test-cookbook -c 1151620 --dry-run sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet [09:27:28] <marostegui> Yeah, asking to confirm, because I am getting https://phabricator.wikimedia.org/P76554 so I wasn't sure what was this about [09:32:39] <federico3> uhm, no, that's an issue with the host not being found in the puppet query [09:35:22] <federico3> e.g. sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'es2035.codfw.wmnet' this starts , while sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'db2187.codfw.wmnet' does not find the host [09:35:49] <volans> it's a (mariadb::sanitarium_multiinstance [09:36:56] <marostegui> Ah yes [09:37:07] <marostegui> I had the same issue yesterday wit db2186 and i forgot this is the same XD [09:37:10] <marostegui> My bad sorry [09:55:44] <Amir1> federico3: go for it! [10:03:18] <marostegui> federico3: Running the script to upgrade db2187, let's see how it goes! [10:09:27] <marostegui> federico3: all good [10:09:53] <federico3> ok, thanks [10:10:14] <federico3> CR merged [11:02:24] <federico3> can I access prometheus.svc.eqiad.wmnet from gitlab CI ? [13:07:19] <marostegui> I am failing over m1 master [13:52:48] <federico3> marostegui: https://phabricator.wikimedia.org/T384212#10862972 are you referring to creating one user or two? (one for show replica on all databases and another to write on the zarcillo db?) [13:54:42] <marostegui> federico3: I am busy at the moment with the x3 split [13:54:54] <federico3> no worries [14:01:23] <marostegui> We are going to set s8 (wikidata) as RO for a few minutes to split x3 from it [14:41:57] <Amir1> zabe: hii, I killed your s8 migration script, since we set the db to read only and it was still writing, would you mind turning it on again when you have time? [14:42:34] <zabe> ye [14:42:36] <zabe> s [14:42:46] <marostegui> We should actually fix that [14:42:53] <marostegui> It is pretty dangerous [14:43:12] <marostegui> It got me quite confused for a few minutes [14:43:49] <marostegui> And we could've had a split brain [14:44:18] <taavi> will the x3 split reach wiki replicas today or will that happen a bit later? [14:45:13] <marostegui> taavi: Actually, that is more complex than we think, we need to reimport all that into the sanitarium host and then into the wikireplicas, so I don't think it is happening today [14:45:34] <marostegui> What's the size of the data we are talking about? [14:46:36] <marostegui> Any of those tables need filtering? Or it is all public? [16:18:26] <jynus> es1035 memory alert is flapping now on -operations [16:31:53] <federico3> looking [16:34:16] <federico3> marostegui: shall we prioritize the restarts e.g. tomorrow morning? [16:35:00] <federico3> i've never done the security updates on es* - any pointer from Amir1? [16:36:48] <Amir1> would the restarts actually fix the problem or just postpone it? [16:37:52] <Amir1> federico3: if it's a replica of a RW section (es6-es7), the script should just work (just set the section to es6 or es7) for RO sections, it's a bit more complicated as there is no replication [16:38:06] <federico3> afaik we don't know but at least we get out from the almost-emergency right now [16:41:00] <Amir1> es1035 is a master of a RW section, you need to do some work, it's complex [16:41:21] <Amir1> same goes for es2038 [16:41:51] <Amir1> you can't just use automation for them, you have to first stop writing to that section, then do a switchover in dbctl, then depool it and then you can restart it [16:41:57] <federico3> I'm going to be afk this evening (in 10 mins) :-/ so far it looks like we might be ok for a little bit more but i'd rather get the restart done soon [16:42:58] <federico3> would it make sense if you do the most urgent restarts and I follow/document the process? [17:04:46] <federico3> (if it can help l'll be back in 3-4 hours) [17:28:15] <marostegui> federico3: note that I posted this earlier on the task: https://phabricator.wikimedia.org/T395294#10863529 [21:00:41] <federico3> Amir1: are you around by any chance? [21:06:25] <jinxer-wm> FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed