[07:26:39] Amir1: morning, can I start the schema change in s5 in eqiad? [08:08:38] bad thing with bacula, it got stuck during the weekend [08:22:53] due to trying to pull from something that was unhappy? [09:06:33] federico3: go for it [09:06:38] ok [09:09:36] Can I fish for a review to unstack backups? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185863 [09:10:22] Amir1: do we want to enable notifications on es2049, pool it in and move on to the next host? [09:12:04] thank you, Amir1 [09:37:21] I forced the refresh of new backup parameters, now backups are unstuck, but it will take some time to catch up [10:35:47] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185879 if you have a second [10:36:29] thanks. this is top prio for me now, just double checking mounting, raid, etc. [10:39:16] that was weird, systemd had got bored of scheduling all the swift timers on ms-fe2009 [10:41:25] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:46] [fixed by restarting both services] [10:56:25] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:39] Amir1: do we have a checklist somewhere? [12:04:26] Amir1: I'm putting together https://wikitech.wikimedia.org/wiki/MariaDB/Provisioning/es_hosts - I can follow it for the other hosts then if we like it we can merge some bits back to the other runbooks [12:08:23] thanks. Sorry I was in a meeting until now [12:09:21] Actually jynus kindly suggested to help taking a look whether the storage setup of es2049 (that is replacing es2026) is correct and e.g. we are not missing any redundancy [12:10:07] my biggest worry rn is that the old one has exactly half the storage as the new one: [12:10:25] old: └─tank-data 254:0 0 10.7T 0 lvm /srv [12:10:25] new: └─tank-data 254:0 0 21.8T 0 lvm /srv [12:10:59] I can have a look as soon as the bacula stuff gets healthy, if federico3 wants [12:11:08] which could (not 100%) mean that redundancy is missing somewhere. It could also mean that we just bought bigger disks [12:11:24] yeah, no rush. Thank you [12:15:00] I'm not quite sure what our expectations are for disk sizes: my understanding is that these are read-only hosts so do we expect the stored data never to grow? [12:15:51] no worries, Amir1 got me up to date and I will check if there is any change needed on recipe or something [12:16:10] just give me a few hours so I can have a look [12:20:23] as a heads up, Amir1, remember to ping me on the ticket you wanted to show me later [12:20:43] es2048 has no megacli installed, is that expected? [12:21:12] megacli has kind of being upgraded to a more compatible tool [12:21:31] rather than vendor specific, chipset specific, it could be that [12:21:39] or could be a bug [12:22:54] storcli64 is the one some modern ones use [12:23:35] Amir1: thanks [12:23:59] hm the storcli package is also absent [12:24:40] also no perccli64 [12:25:29] check on facter if it has detected its raid configuration and applied the right module [12:26:34] if a new model has not been detected it may require some puppet changes, in collaboration with infra foundations [12:27:08] but thats an if, better confirm there is a config not detected first before bothering other theams [12:28:33] I just checked and the controller of those hosts is correctly configured [12:28:50] that means some things may need to be adjusted puppet-wise [12:29:04] the RAID controller, I mean [12:31:09] federico3: this is the code where it gets autodetected: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/raid/lib/facter/raid.rb [12:31:54] maybe this new set of hosts require additional tuning, those are the things amir asked me to help you with, if you are ok (but you can try to do first on your own :-D) [12:33:41] > I'm not quite sure what our expectations are for disk sizes: my understanding is that these are read-only hosts so do we expect the stored data never to grow? [12:33:41] These are still canonical data so I really want to make sure we have good redundancy while. They still don't grow. On why we might have bought bigger disks: It could be that ES config has changed and disk is cheap (comparatively) so for the sake of standardization, we bought bigger disks on RO ES hosts too. [12:33:43] sure! Are the owner of the raid configuration e.g. is possible the we receive a raid set up in striping when we wanted mirroring? [12:34:06] Indeed it seems it has a new device id: 100010e2 [12:34:44] so my suggestion to you is to try to 1) manuall install the lastest megaraid recommeneded tool on one host (I think it is not megacli) [12:35:38] 2) if it works as intended, create a patch 3) involve someone from infra foundations to review the change [12:35:56] federico3: ^ does that seem reasonable? [12:38:01] feel free to pm me in private if you have more questions [12:38:09] it seems puppet is detecting which tool to install across various files, including megacli, storcli, perccli64, and some tools seem to expect either megacli or md e.g. modules/raid/files/check-raid.py [12:38:37] yep, but that's not happening because of the new device number [12:39:13] I think the fix is easy, just asking if confident to take over (obviously with a review later from, e.g. me and someone from infra) [12:39:32] or have more questions [12:40:26] mmm [12:40:29] but now that I see icinga [12:40:40] it is doing the right thing [12:40:44] so maybe it is not that [12:41:10] The checks seems to be running ok: "communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK" [12:43:07] I'm nosing around puppet but there's a lot of locations mentioning the raid tools - anyhow if we are not the owners of this maybe we should ask the owners straight away? [12:43:07] es2048:~$ sudo /usr/bin/perccli64 show all # output looks fine to me [12:43:26] well, I was trying to confirm the issue you mention first, but I don't see it [12:48:14] I am seeing all infor on the command: https://phabricator.wikimedia.org/P82704 [12:48:27] https://www.irccloud.com/pastebin/mEAU2J6k/ [12:48:29] and the check looks normal to me, is there something you are missing ? [12:48:46] yep, did you install it manually? [12:49:11] I did not - but is this the setup we wanted in the first place? [12:49:58] yep, I had checked it through the web interface (not on every host, but at least on one: RAID10 with write back and 512 strip size [12:50:15] just that the disk density doubled [12:50:33] it may still need some tuning on the install recipe becaue of this change [12:50:36] we have raid10, looks healthy [12:50:47] yep [12:51:24] I think the suprise comes from the fact that Manuel may had oked the disk size increase, and that came to a surprise to the other dbas [12:51:28] ...but we don't if why we received larger drives? [12:51:45] the quote was approved [12:52:05] usually that means that the disks were probably the same price or something [12:52:17] or thinking about the non-ro ones [12:52:29] so we don't have to handle 2 different models [12:52:38] that part I cannot say [12:52:55] I only can trasmit what I can see on the ticket: [12:53:24] yeah and specially if we need to move hosts between sections [12:53:32] context: https://phabricator.wikimedia.org/T398511 [12:53:49] my guess is we got the command to reduce the number of servers [12:53:53] and that's one way to do it [12:54:15] maybe in the future one host can host multiple ro sections (?) [12:54:37] that's kind of decisions I cannot comment on, but it would be a wise thing to consider [12:55:18] surprisingly or not, also larger disks, with time, become cheaper per GB [12:55:36] at the cost of a little bit less of redundancy [12:55:49] command> I think it was a request (e.g. I had a discussion with dc-ops, and we decided not to make swift nodes denser) [12:56:03] i checked the virtual drives on all the new hosts: identical across all of them, so i guess we are good to go [12:56:13] yep, I didn't mean to missrepresent [12:56:29] I mean it as a "dc strategy we were asked to contemplate" [12:56:35] *meant [12:56:51] and it is ok, specially for ro hosts [12:57:13] these also means that we could potentially consolidate ES read-only sections *if* CPU and network bandwidth allows it [12:57:28] federico3: if the only thing that is weird is the install recipe, maybe create a ticket about considering changing it for future batches [12:57:59] jynus: you mean regarding the need to grow the partition manually? [12:58:03] or if some other thing was broken/needs fixing [12:58:08] yeah, whatever is needed [12:58:29] yes, that's the only glitch in the provisioning it seems [12:58:44] so I would file it as a potential fix (too late now) and move on [12:59:44] BTW I was looking for a disk health dashboard on grafana with raid info etc, is there one? [13:00:52] similar to https://grafana.wikimedia.org/d/eak7BovZk/smart-disk-data?orgId=1 but for prod [13:05:55] The prometheus data is probably there, but not a lot of love for the io side of things, in terms of graphs [13:06:22] I would like to see bandwidth and latency graphs at least on the generic host ones [13:20:41] jynus: https://grafana-rw.wikimedia.org/d/eak7BovZu/raid-and-smart-disk-drive-health?orgId=1&from=now-90d&to=now&timezone=browser&var-instance=%24__all&var-disk=%24__all&refresh=1m&editIndex=0&viewPanel=panel-8 i stolen a copy of the dashboard and look we have ssd wear level monitoring [13:23:24] remember not to get lost on the details, the important goal is to have those host healty and provisioned [14:46:39] Amir1: can I run the schema change on s6 *DC master* in codfw? [14:47:15] (the schema change on s5 replicas in eqiad is ongoing ) [15:09:08] the change in s5 and I'm not seeing other schema changes running [15:10:50] but this is recent https://phabricator.wikimedia.org/T402925#11157979 so I'll wait [15:18:35] Sorry on phone now. Let me check [23:10:25] PROBLEM - MariaDB sustained replica lag on s3 on db1223 is CRITICAL: 418.3 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1223&var-port=9104 [23:14:25] RECOVERY - MariaDB sustained replica lag on s3 on db1223 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1223&var-port=9104 [23:16:21] The reason this alerts is that "systemctl restart prometheus-mysqld-exporter.service" is not run