[06:20:18] Going to switchover pc3 master, I don't expect lots of impact on the hit rate, as I have had the spare replicating in pc3 for a few days [09:53:45] Amir1: once you are done with es3, do you want to do es5? (I am going to do es4). es4 and es5 do require a proper master swap as they are not RO [09:53:58] so you can do your first DB switch [09:54:11] sure sounds good to me [09:54:21] ok, I will assign the task to you [09:58:09] oh wow I just created task T300005! [09:58:10] T300005: Upgrade es4 to Bullseye - https://phabricator.wikimedia.org/T300005 [09:58:16] Bye bye T2*! [09:58:16] T2: Get salt logs into logstash - https://phabricator.wikimedia.org/T2 [09:59:02] lol [09:59:09] XD [10:00:08] I am glad that stashbot didn't parse "*" and sent all the tasks titles from T2 onwards XD [10:01:32] It should just say "I'll be back" [10:01:53] 👏 [10:08:49] * Amir1 sighs [10:08:55] It's so bad it's good [10:18:06] es2029 seems to getting reimaged easily [11:43:26] kormat: thank you for the quick action on T300013 <3 [11:43:26] T300013: Zarcillo access for Prometheus new hardware - https://phabricator.wikimedia.org/T300013 [11:44:21] random/idle thought I had when thinking about decom, sometimes I codesearch for the hosts' addresses I'm decom'ing and wondering if grants are available somewhere for grep ? [12:06:21] godog: Amir has plans to work on a user-friendly solution for this problem sometime in the future [12:40:40] yup ^^ [12:41:00] marostegui: lingering connection to depooled es3 host in eqiad by dumper [12:41:44] Amir1: yeah, sometimes it takes a while [12:42:55] Amir1: If it takes a long time, you can ping ariel to see if it is ok to kill that specific connection so the snapshots will hit a non depooled host [12:47:15] yeah [13:10:12] sobanski Amir1 thank you! sounds great [13:54:04] db2086 will stay down overnight for https://phabricator.wikimedia.org/T299882#7648915 [14:04:11] it just needs some rest for once 🥺 [14:23:10] when you will upgrade es4 and 5, things will be just easier for both if you avoid mondays (backup day), but we can workaround that if needed (just talk to me) CC Amir1 [14:23:25] sorry, that would be Tuesdays, not mondays [14:24:26] afk [14:24:28] jynus: sure no problem!. As we are going to "close" a section for the master's reimage, that should be fine [14:25:42] thanks [14:25:45] I am going to remove logpager from wikidata btw [14:25:51] Let me know if there're spikes or something [15:44:19] back- so to expand on the es backups: because currently they take almost 20h or so, it is easy to move them if necesary (earlier or later), just ping a day in advance and I can make sure they don't block restarts/upgrades [15:51:17] * kormat glares at mysql using VARCHAR(3) to hold a binary value (information_schema.COLUMNS.IS_NULLABLE) ಠ_ಠ [15:52:11] it is to be able to store "tru" or "fal", which is the actual sql-92 standard [15:52:32] they store `'YES'` or `'NO'`, from what i can see [15:52:54] I was making that up, but reality seem not far away [15:53:04] easy: 0, 1, NaN [15:55:33] this is a nice funny wiki section: https://en.wikipedia.org/wiki/Boolean_data_type#SQL [15:55:44] "Access represents TRUE as −1, while it is 1 in SQL Server" [15:56:52] * kormat runs away screaming [15:57:24] That's one way to ensure you require an "Enterprise Migration Package 2010" [15:59:27] we should blame Łukasiewicz, according to wiki [16:01:38] in mw, we use tiny int which means it can go even further [16:01:47] > PROBLEM - Host db2086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:01:52] Should I be worried [16:02:07] wasn't that the one manuel sent to papaul? [16:02:25] Amir1: yes. you should be worried you don't remember it being mentioned ~2h ago [16:02:37] yeah: T299882 [16:02:37] T299882: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 [16:02:42] yes that is the one [16:02:59] searching in my inbox didn't bring it :/ [16:03:29] I assume someone forgot to downtime it :P [16:04:29] Amir1: searching your email will only work if you're subbed to the task [16:04:35] (which, in this case, you are not) [16:04:41] clearly a moral failure [16:05:22] I'm actually "soft-subscribed" to every ticket tagged with DBA [16:05:24] one thing I do, when I see an alert is to search the host on phab, and 99% of the time there is a task about it [16:05:57] e.g. if you see randomhost1002 alerting [16:06:48] yeah, better than searching in my inbox :D [16:08:04] to be fair, there is a lot of room for improvement on alerting worflow, which is why there is a dedicated team working on it (tooling, awareness) [16:18:46] sigh, writing sh feels like scripting with one hand tied behind your back [16:20:46] (how can you live without `set -E` and `trap ... ERR`???) [16:35:06] I'm about to do a failover of es3 in eqiad, it's noop as it's RO [16:35:51] * kormat looks at https://github.com/wikimedia/puppet/blob/production/modules/install_server/files/autoinstall/scripts/reuse-parts.sh, and nods solemnly at Emperor [17:09:49] db2086 is back [19:13:05] now I did a failover of s3 master back to es1028 [23:15:59] PROBLEM - MariaDB sustained replica lag on s4 on db2090 is CRITICAL: 6.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104 [23:17:47] PROBLEM - MariaDB sustained replica lag on s4 on db2140 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2140&var-port=9104 [23:18:21] RECOVERY - MariaDB sustained replica lag on s4 on db2090 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104 [23:20:07] RECOVERY - MariaDB sustained replica lag on s4 on db2140 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2140&var-port=9104