[07:49:45] <federico3>	 Amir1: can I start reboots in s6 in eqiad and s3 in codfw?
[08:17:26] <marostegui>	 I am going to reboot the non active proxies
[08:25:16] <Emperor>	 Morning all, could someone +1 https://gerrit.wikimedia.org/r/c/labs/private/+/1151605 please? Adding an apus account to labs/private
[08:58:41] <Emperor>	 (this has been done)
[09:12:50] <marostegui>	 federico3: I think this sort of broke the upgrade cookbook: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151219 because I used it (so I didn't add -t) and when the cookbook finished I got: https://phabricator.wikimedia.org/P76552
[09:15:04] <federico3>	 ah I missed one of the task_comment entries, just a sec
[09:15:16] <marostegui>	 thanks
[09:16:52] <federico3>	 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151620  
[09:17:37] <federico3>	 I can run the upgrade myself with test-cookbook -c 1151620    or you can do it on your side so that we do a real end-to-end test (not dryrun) before merging?
[09:17:54] <marostegui>	 Ok, I will do in a bit
[09:18:37] <federico3>	 let me open a cleanup task for better safety
[09:25:19] <marostegui>	 federico3: confirm: test-cookbook -c 1151620 sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet ?
[09:26:30] <federico3>	 yes but I would always do a dry run immediately before :)
[09:26:41] <federico3>	 test-cookbook -c 1151620 --dry-run sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet 
[09:27:28] <marostegui>	 Yeah, asking to confirm, because I am getting https://phabricator.wikimedia.org/P76554 so I wasn't sure what was this about
[09:32:39] <federico3>	 uhm, no, that's an issue with the host not being found in the puppet query
[09:35:22] <federico3>	 e.g. sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'es2035.codfw.wmnet'    this starts , while       sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'db2187.codfw.wmnet'    does not find the host
[09:35:49] <volans>	 it's a (mariadb::sanitarium_multiinstance
[09:36:56] <marostegui>	 Ah yes
[09:37:07] <marostegui>	 I had the same issue yesterday wit db2186 and i forgot this is the same XD
[09:37:10] <marostegui>	 My bad sorry
[09:55:44] <Amir1>	 federico3: go for it!
[10:03:18] <marostegui>	 federico3: Running the script to upgrade db2187, let's see how it goes!
[10:09:27] <marostegui>	 federico3: all good
[10:09:53] <federico3>	 ok, thanks
[10:10:14] <federico3>	 CR merged
[11:02:24] <federico3>	 can I access prometheus.svc.eqiad.wmnet from gitlab CI ?
[13:07:19] <marostegui>	 I am failing over m1 master
[13:52:48] <federico3>	 marostegui: https://phabricator.wikimedia.org/T384212#10862972 are you referring to creating one user or two? (one for show replica on all databases and another to write on the zarcillo db?)
[13:54:42] <marostegui>	 federico3: I am busy at the moment with the x3 split
[13:54:54] <federico3>	 no worries
[14:01:23] <marostegui>	 We are going to set s8 (wikidata) as RO for a few minutes to split x3 from it
[14:41:57] <Amir1>	 zabe: hii, I killed your s8 migration script, since we set the db to read only and it was still writing, would you mind turning it on again when you have time?
[14:42:34] <zabe>	 ye
[14:42:36] <zabe>	 s
[14:42:46] <marostegui>	 We should actually fix that
[14:42:53] <marostegui>	 It is pretty dangerous 
[14:43:12] <marostegui>	 It got me quite confused for a few minutes
[14:43:49] <marostegui>	 And we could've had a split brain
[14:44:18] <taavi>	 will the x3 split reach wiki replicas today or will that happen a bit later?
[14:45:13] <marostegui>	 taavi: Actually, that is more complex than we think, we need to reimport all that into the sanitarium host and then into the wikireplicas, so I don't think it is happening today
[14:45:34] <marostegui>	 What's the size of the data we are talking about?
[14:46:36] <marostegui>	 Any of those tables need filtering? Or it is all public?
[16:18:26] <jynus>	 es1035 memory alert is flapping now on -operations
[16:31:53] <federico3>	 looking
[16:34:16] <federico3>	 marostegui: shall we prioritize the restarts e.g. tomorrow morning?
[16:35:00] <federico3>	 i've never done the security updates on es* - any pointer from Amir1?
[16:36:48] <Amir1>	 would the restarts actually fix the problem or just postpone it?
[16:37:52] <Amir1>	 federico3: if it's a replica of a RW section (es6-es7), the script should just work (just set the section to es6 or es7) for RO sections, it's a bit more complicated as there is no replication
[16:38:06] <federico3>	 afaik we don't know but at least we get out from the almost-emergency right now
[16:41:00] <Amir1>	 es1035 is a master of a RW section, you need to do some work, it's complex
[16:41:21] <Amir1>	 same goes for es2038
[16:41:51] <Amir1>	 you can't just use automation for them, you have to first stop writing to that section, then do a switchover in dbctl, then depool it and then you can restart it
[16:41:57] <federico3>	 I'm going to be afk this evening (in 10 mins) :-/   so far it looks like we might be ok for a little bit more but i'd rather get the restart done soon
[16:42:58] <federico3>	 would it make sense if you do the most urgent restarts and I follow/document the process?
[17:04:46] <federico3>	 (if it can help l'll be back in 3-4 hours)
[17:28:15] <marostegui>	 federico3: note that I posted this earlier on the task: https://phabricator.wikimedia.org/T395294#10863529
[21:00:41] <federico3>	 Amir1: are you around by any chance?
[21:06:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed