[01:06:25] FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:25] FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:45] ^ I am checking that [05:08:34] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on db2186:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:08:36] I also fixed this ^ [05:11:25] RESOLVED: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:07] I am switching m2 master [06:45:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:17] marostegui: I'm trying to compare memory usage across es* - they all fill up the innodb cache up to the max in a couple of days, but some have a much higher number and rate of tmp tables in RAM https://grafana-rw.wikimedia.org/d/a62fa0c4-c1b1-4a8c-b53b-1d015de9049f/federico-s-mariadb-memory-overview?orgId=1&from=now-6h&to=now&timezone=utc [07:06:32] federico3: it could be many many things, from connection spikes, to small leaks, to who knows. But we have to get that restarted [07:07:21] is this difference in tmp tables expected? Also, as a quick workaround we can just lower the lower innodb cache size without restart, shaving off a handful of GB just to avoid immediate risk of saturating the RAM [07:10:39] please don't every touch the innodb size dynamically, with our load it will crash [07:10:53] specially on such large sizes [07:12:31] it can be lowered if needed, but a restart is much safe. For large buffer pools, in my experience, it leads to the dbs going unresponsive due to the high load it takes [07:17:37] jynus: do you know why some instances gets a much higher tmp table creation rate? See https://grafana-rw.wikimedia.org/d/a62fa0c4-c1b1-4a8c-b53b-1d015de9049f/federico-s-mariadb-memory-overview?orgId=1&from=now&to=now&timezone=utc [07:18:00] before looking, usually it is the kind of queries done [07:18:27] tmp tables are not a huge issue unless they hinder performance [07:18:46] for example, some sorting always create tmp tables, and many times that's ok [07:19:23] e.g. maybe backups are doing that [07:19:35] or monitoring [07:21:16] my suggestion: before trying to think how to resolve a problem, think if it is worth resolving a problem [07:22:21] I'd say you should restart the hosts asap, without thinking- as others SREs outside of our team are already complaining about the ongoing alarms [07:22:32] not acting on that would be quite rude [07:23:02] only after that, if a memory leak is detected again, then start researching [07:26:15] thanks, I'll do the restart as soon as Amir1 can show me how [07:27:57] I can show you if you want- but for a replica- depool, restart, repool (that should be easy and not different from other sections) [07:28:11] a primary will require a switchover [08:18:52] jynus: ok I can plan the switchover using the tool, I'm pinging you in query for details [09:19:56] federico3: as amir mentioned yesterday you need to first stop writes on that section. I'd suggest to generate the ticket using the tool and it will tell you all the steps [09:47:40] marostegui: I created the switchover tasks for the 2 master hosts. I think it would be safer to start with depooling and restarting the replica (es2039). Is there documentation on how to depool it? [09:49:16] (I asked jynus in query but he's not sure if it requires a mediawiki deploy) [09:49:38] yeah, I may have outdated info, as that is how it was done before [09:50:20] federico3: the task itself tells you how to do it [09:50:22] but I would create a ticket first so it can be reviewed [09:50:56] federico3: I'd suggest you wait for Amir1 to be on standby, I cannot do it now. [09:51:47] marostegui: sorry, which task? The master switch tasks contain a runbook, but I'm asking about the replica :) [09:52:17] no worries, I'll ask Amir1 [09:52:47] federico3: which tasks have you created? [09:53:23] marostegui: the two switchover related to https://phabricator.wikimedia.org/T395294 [09:53:29] You cannot depool a master, you first switch the host and then depool [09:53:46] yeah, federico3, the task you created has the answer to your question: T395544 [09:53:47] T395544: Switchover es7 master (es1035 -> es1039) - https://phabricator.wikimedia.org/T395544 [09:53:51] It actually tells you how to do it step by step,including the depooling once the host is no longer a master [09:54:09] Take a read to the steps and check if they make sense [09:55:36] I'm not talking about the switchover process, I'm aware it's written in the two tasks :D I'm asking about the depooling of the *replica* host and the restart of mariadb [09:56:12] federico3: dbctl instance HOST depool [09:56:14] e.g. after depooling the replica, can we do a restart of the mysql process and bounce it? Does it need some checks [09:56:18] Like a normal database [09:56:32] eg. any specific step for es? Just normal repool after restart? [09:57:03] federico3: As I mentioned, if it is a replica, I would suggest you just run the upgrade cookbook, which upgrades the kernel, reboot and mariadb [09:57:18] So you also upgrade the kernel as that's part of the other task too [09:58:54] ok so to summarize for es2039 (currently replica): 1) normal depool 2) upgrade cookbook 3) repool , nothing special about es. I can do it immediately or wait for Amir1 as you prefer [09:59:51] You can do the replica anytime [09:59:58] federico3: I'm out today it's public holiday here [09:59:59] Repool in steps by the way [10:00:02] thanks, starting now! [10:00:07] federico3: tiene sentido [10:00:21] marostegui: --slow or just the "normal" speed? [10:00:28] jynus: ?? [10:00:48] sorry, Spanish leaked, I meant that your plan seemed sane [10:00:57] :D [10:01:34] tiene sentido literally "it is sensible" [10:01:58] รจ sensato [10:02:01] (sounds very similar to "makes sense") [10:02:09] federico3: 4 or 5 steps is fine [10:12:38] federico3: can you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151829 I need to deploy that [10:13:04] looking [10:19:29] federico3: Will also need this https://gerrit.wikimedia.org/r/1152030 [10:19:29] marostegui: replied. I try to describe what I see as a check if that helps [10:19:39] yep thanks [10:20:01] es2040 is lagging [10:20:10] for that dns change, please check that the yaml for both dbproxies are the same [10:20:32] es2038 is down? [10:20:43] lots of traffic [10:21:34] but write throughput is down [10:21:53] I think semi sync is having issues with es2039 being down? [10:22:06] :-( [10:22:14] yes, es2039 has semi sync issues [10:22:15] fixing [10:22:22] I am so tired of this semi sync [10:22:24] any task related to the DNS change? [10:22:35] not touching anything but let me know if you need help [10:22:42] federico3: please read above, we are having issues with es [10:22:58] I saw it, anything I can do? [10:23:10] es2040 is about to p*ge [10:24:22] I don't see user impact so far (errors on log) [10:24:42] (wait for replication is running on es2039, could It be that es2039 is causing impact due to the replication traffic?) [10:26:34] got the page [10:26:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance es2040:9104 has too large replication lag (11m 1s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2040&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [10:26:51] are these pages expected? [10:27:34] Emperor: dbas are handling it, not expected [10:27:50] OK, cool. [10:27:55] I'll keep out the way :) [10:28:38] jynus: what do you make of this? https://grafana.wikimedia.org/goto/-aWf1AfHR?orgId=1 [10:29:18] I don't have full context, let's support manuel in what we can first, debug later [10:30:49] the host is fried [10:30:58] :-( [10:31:09] my point is that es2039 is not catching up with replication and I wonder if getting it back in prod can alleviate load on on 2040 [10:33:12] onece deploy happens we will have a more confortable time to act on that [10:36:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance es2040:9104 has too large replication lag (20m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2040&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [10:37:21] nice [10:38:13] I think we are out of the woods now [10:38:50] https://phabricator.wikimedia.org/T395551 for details [10:39:05] thanks I will comment there later [10:39:22] The deployment is taking ages, but I will let them disabled for a bit to make sure everything is stable [10:40:01] yep, aviate, navigate, communicate in that order :-D [10:40:31] exactly! [10:40:38] Writes are disabeld [10:42:51] let me know how I can help [10:45:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:28] es6 seems healty btw [10:46:37] yeah we should be good now [10:46:49] I am going to enable writes again, I don't want to create much unbalance between es7 and es6 [10:46:53] 2039 is not pooled in at this time, but it caught up with replication [10:47:09] please start pooling it now [10:47:12] did you restart the fried intermediate? what was the fix? [10:47:15] I can restart the upgrade script to do the pool-in and remove the puppet silence [10:47:20] jynus: I did and upgrade to 10.6.22 [10:47:25] thanks [10:47:26] https://phabricator.wikimedia.org/T395551#10866996 [10:47:43] federico3: Just pool in please [10:47:47] ok pooling in [10:48:14] federico3: es2039 was restarted already, so just needs to be pooled in [10:50:13] I am handling db2186 so you don't need to do anything there [10:50:16] pooling in immediately while I remove the icinga silence [10:50:32] thanks jynus [10:50:36] I appreciate your help a lot [10:50:45] I am going to enable writes [10:50:52] sounds good? or should we wait a bit? [10:51:17] I'm seeing this error "ERROR ferm input drop default policy not set, ferm might not have been started correctly" from es2039 [10:51:26] restart it [10:51:31] (ferm) [10:51:32] es7 seems fine, if you are ok, I am ok [10:51:45] ok I will go for it [10:52:06] wait for ferm to confirm ok, not sure if it will reset connections [10:52:13] federico3: did it? [10:52:25] ok, I will hold when it asks me to confirm the deployment [10:52:30] but revert can start [10:52:34] yep I started [10:52:35] just not the deploy [10:53:22] I'm not seeing fern as a systemctl service, how do I start it? [10:53:29] ferm [10:53:35] let me check [10:53:40] oh sorry , misread [10:53:41] federico3: root@es2039:~# systemctl restart ferm.service [10:53:42] ok starting it [10:53:58] confirm when done /check is ok [10:53:59] it's up [10:54:09] ok [10:55:03] it's complaining about an error in the config file but running [10:55:05] db2186 is weird because it references s3, but that is an x1 host [10:55:22] jynus: yeah because it was an ex sanitarium host [10:55:24] running puppet but if I cannot fix it I will leave it as is [10:55:25] ah! [10:55:28] so maybe it still has some leftovers [10:55:36] ok, so not a priority for now [10:55:39] jynus: Don't spend much time on it, I can always reimage and start fresh (leaving the data) [10:55:44] yep [10:55:53] jynus: mind creating a task? [10:56:02] Deploying [10:56:03] I will [11:03:22] writes enabled [11:08:25] es2039 is at 25% and pooling in [11:31:07] back [11:33:00] I will handle now, with more time db2186 unless you need me for something else urgent [11:36:56] can I start https://phabricator.wikimedia.org/T395544 ? the switchover for the remaining master? or do we think it's not safe at this time? [11:40:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:16] federico3: Let's not touch es anymore today [11:45:24] better ack the alert for a few days [11:45:39] federico3: ^ can you do that? [11:45:40] so it doesn't annoy the sres [11:45:51] and continue the work later on [11:46:12] note I will finish my week soon, so I won't be around until monday [11:46:27] enjoy your time off jynus [11:46:54] ok [12:02:08] I can extend the alert silence for a week [12:03:02] actually, I realized I have a last meeting so cannot go still he he [12:03:08] but is there something we want to tweak in monitoring or just a silence? [12:03:10] s/still/yet/ [12:04:13] IMHO monitoring worked as expected: WARN for a week before it is too late and ALERT when it is close to start having perf issues [12:04:44] if we waited more (e.g. actual swapping) the host would be unrecoverable without a restart [12:11:03] when hosts start drifting away from the expected values i'd rather have an alert that triggers a highlight on IRC [12:25:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:05] marostegui: we found a permission missing for federico3 [12:36:27] I am working with him on that, FYI [12:36:29] which is? [12:36:40] icinga write rights [12:37:00] interesting [12:37:03] meanwhile I'm setting the silencing on alertmanager [12:37:33] yep, I can set silences using the cookbook or alertmanager [12:38:04] I will ask someone else if I cannot fix it immediatelly [13:35:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:23] marostegui: can I go on with the reboots with more sections? [13:43:10] federico3: yes [13:43:13] I just finished rebooting the proxies [13:43:21] I'd appreciate if you can check the CR I sent you [13:43:23] So I can push that [13:44:17] sure, 1 min [14:35:15] federico3: How are you rebooting? Are you running the upgrade cookbook? [14:35:56] the rolling_restart.py from Amir [14:36:09] federico3: Does that run a apt full-upgrade? [14:36:25] It would be interesting to upgrade the mariadb package too as we have to depool the hosts and stop mariadb anyway [14:36:36] So we roll out 10.6.22 which has the sync fix [14:36:41] let me check [14:42:29] good question - is the update triggered by puppet during the restart? Anyhow yes I can open a task for integrating this [14:45:16] No, the update will be picked up once apt full-upgrade is done [14:45:18] otherwise it won't [14:45:25] How does the script install the new kernel? [14:57:28] I think kernels are being rolled out by the security team and installed in advance, the reboot only switches to the latest kernel [15:02:54] I see the packages for older kernels being left installed [15:04:00] I think we should do a full upgrade [15:04:03] If we are restarting the hosts [15:04:13] anyhow if we were to switch to the upgrade cookbook it would be easier [15:05:00] we get the kernel update as a bonus, and have the new tooling that can update phabricator if needed and so on [15:05:12] Yeah, I hoped we were using that cookbook [15:05:52] Is the reboot script easy to modify to include a apt full-upgrade -y? [15:07:00] no, unfortunately not easy at all, but perhaps we can instead use the upgrade cookbook to run it on multiple hosts [15:09:56] federico3: The upgrade cookbook depools and then repool the host? [15:10:11] actually there's another benefit! IIRC the upgrade cookbook is unaware of the DB role. If we were to make the upgrade cookbook able to detect the role it could also refuse to upgrade a live master and other safety features [15:10:45] That is for later, let's focus on the kernels then [15:10:54] yes [15:11:06] I was hoping there was an easy way to include the mariadb upgrade too [15:11:29] federico3: The upgrade cookbook depools and then repool the host? --> yes [15:11:52] host.run_on_host('service mariadb stop') I guess we can include there the upgrade of the mariadb package maybe? [15:11:57] ah maybe there's an easier way [15:13:30] rolling_restart.py is scanning all replicas filtering out the ones being used for backup, staging, etc, [15:13:45] probably we can just make it call the existing cookbook when needed [15:14:39] A script calling a cookbook and both will try to repool a host? I think this is getting very complex [15:14:43] Did you see my comment above? [15:16:51] no, something easier: we don't touch anything in the upgrade cookbook and we just put an "if" in rolling_restart.py like: [15:16:51] if do_full_upgrade: execute the upgrade cookbook that does depool/upgrade/repool [15:16:51] otherwise: execute its own code without change (so does depool/reboot/repool as usual) [15:17:25] host.run_on_host('service mariadb stop') I guess we can include there the upgrade of the mariadb package maybe? --> ah only one package? sure [15:17:59] albeit we would enter in a new territory of running the OS with only mariadb updated and not the other security fixes... do we want that? [15:18:23] See above, I've suggested apt full-upgrade [15:19:40] you mean we call apt-get upgrade ourselves without cookbook? That also works [15:21:09] we can just add a flag, it should be pretty quick [15:24:26] [17:11:52] host.run_on_host('service mariadb stop') I guess we can include there the upgrade of the mariadb package maybe? [15:27:16] I'm saying that, yes, we can call apt-get directly that will update all packages including mysql [15:27:26] so that was my whole question XD [15:28:21] does mariadb require special handling to specify which version we want or a full os upgrade is always good? [15:30:34] ah, we have packages with the version in the package name e.g. wmf-mariadb106 wmf-mariadb1011 [15:31:39] marostegui: I haven't done mariadb version changes, what is the process? Do you remove wmf-mariadb106 and then install wmf-mariadb1011? [15:32:55] federico3: No, as I said, apt full-upgrade will do it [15:33:29] ah probably there's some Depends/Conflicts involved, fair enough [16:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2187:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed