[01:09:47] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 33.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:37] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 19.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:13:03] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:15:31] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:51:12] I have upgraded one of the eqiad sanitarium hosts to 10.6 [05:51:18] We'll see how long it takes to break XD [05:51:28] Hello, dbstore1005 "staging" instance seems to be unreachable from the stat hosts, probably related to the reboots yesterday and similar to T321464 cc marostegui [05:51:29] T321464: dbstore1005 "staging" instance is down - https://phabricator.wikimedia.org/T321464 [05:51:42] stevemunene: let me check [05:51:52] Amir1: ^ for what is worth, does your script check for "non standard" dbs? [05:52:26] stevemunene: should be now back up [05:53:15] Yes it is, thanks marostegui [05:53:26] thanks - sorry for the incovenience [05:57:05] jynus: tomorrow I'll probably be ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/942652 but looks like that needs a manual rebase [06:39:58] Greetings! I'm shopping around to find a home for a ~3.4GB data set we've created by scraping and analyzing the HTML page dumps. I'm happy to break it into subsets, create the metadata for publishing, etc. but looking for guidance about where to host to make it available to researchers and Wikimedia teams. More details: [06:40:04] https://lists.wikimedia.org/hyperkitty/list/wiki-research-l@lists.wikimedia.org/thread/M3IDFYT44O2NDGKKU7FG5Q25YTY4KGCS/ [06:42:05] awight: How about https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb or is it private data? [06:48:27] marostegui: It's public data, but currently stored in flat files and my hunch is that consumers would prefer it that way rather than in a database. But this is very much where my experience ends... I imagine that a "feature store" is actually the ideal way to distribute these metrics, but we don't have something like that yet? [06:49:20] awight: I don't think storing JSON content on a mysql database is ideal to be honest [06:49:57] awight: I wonder if data analytics can provide some better places than a relational database for that [06:51:50] +1 although these could be coerced into a db, I agree and my plan is to just upload these flat files somewhere. If my post sounded like I was considering a RDBMS, that was accidental. [06:52:14] Ah yeah, I thought you wanted to push that data into a database [06:52:37] So yeah, I don't really know where you'd be able to push those files as they are, maybe on people.wikimedia.org? [06:52:39] I don't really know [06:55:14] If I understood elukey correctly, I think there was a suggestion to drop them in a location like https://analytics.wikimedia.org/published/datasets/one-off/ but I was supposed to check with SRE about whether pushing this quantity of data for medium-to-long-term storage was reasonable. [06:55:57] awight: I think in that case I suggest you ask in wikimedia-sre because I am not exactly sure who owns that [06:56:12] :-D thanks for the hint! [06:56:17] Good luck! [06:58:55] let me do it for you: marostegui [06:59:26] jynus: Cool, not yet though. Probably tomorrow, I will let you know [06:59:40] I mean the rebase 0:-D [06:59:41] Ah you meant the rebase :) [07:49:12] arnaudb: welcome (belatedly, it was a public holiday here yesterday) [07:50:01] hello Emperor and thanks :-) [07:52:09] marostegui: done https://gerrit.wikimedia.org/r/c/operations/puppet/+/942652 [07:52:30] Emperor: how are you, are you ok? [07:55:45] Still fed up at the cancelled holiday; otherwise definitely getting better (mostly just a lingering cough now) [07:56:08] Also, I finally figured out what SRE really means: Sysadmin Running Emacs ;-) [07:57:44] I thought it meant (Sucre, Bolivia) airport code, at least it is what marostegui told me! [07:58:41] lol [07:59:32] Emperor: I am sorry about your vacations, but happy it is not something major [07:59:59] stupid Covid :( [08:00:01] You got a couple mentions on yesterday meeting, feel free to ping me later on [08:00:18] later as in, any time this or next week [08:01:00] also, even if it is not anything major, shouldn't you be in sick leave? [08:01:21] Nah, I'm well enough to work, I think [08:02:58] jynus: Nothing obvious in the notes, so a brief summary would be helpful, or I'm speaking to kwaku.ofori tomorrow in any case [08:03:12] let me pm you [08:07:35] marostegui: sorry I was asleep. yeah I think the script did it. [08:07:41] stevemunene: my apologies [08:09:02] awight: if you put it in stat machines in /srv/published/ something like that, it'll end up in datasets [08:09:23] I've done it multiple times [08:42:36] all dumps succeded- except es4 and es5, which are ongoing [08:42:46] I am going to restart now backup1- dbs [09:16:24] relocating [09:18:25] Is MariaDB 10.6.14 safe/wise to install or should not touch that (for backup1-)? [09:18:31] *I [09:53:27] PROBLEM - MariaDB sustained replica lag on s1 on db1132 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1132&var-port=9104 [09:54:45] RECOVERY - MariaDB sustained replica lag on s1 on db1132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1132&var-port=9104 [09:55:55] weird glitch [10:00:35] jynus: 10.6.14 is safe [10:00:55] thanks [11:19:40] Emperor: I made a mistake and made many backups fail- I am going to fix this (swift access rate should not change much, but 404 ratio will) [11:20:16] ack [11:20:45] I need to change some code to prevent that from happening again [11:59:29] given the background amount of 404 it is not even noticed on graphs [12:40:29] marostegui: okay if i schedule ten minutes of time to talk? i was thinking 100 minutes from now, so :20-:30 at the end of the day [12:40:37] for the querysampler stuff [12:42:29] dr0ptp4kt: I don't really know much about it. wmcs probably knows a lot more than I do, in fact I only know what you told me a few days ago :) [12:45:08] yeah, i just mainly wanted to coordinate the id stuff (i think only you can help me there) and talk about a couple approaches i had in mind. [14:50:47] Emperor: I think I've asked you this before, but for the life of me can't remember the specifics. What is the process for dealing with an rclone sync failure (https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Ddata-persistence)? [14:52:29] There error: ... Failed to copy: failed to open source object: Object Not Found [14:56:15] I assume "source" means active data center; I guess this is a ghost object scenario? I also assume that it otherwise performed the sync (it ran for quite some time after the specific error). So... are there steps that should be taken for the reported errors? And presumably we `reset-failed` afterward, one way or another? [15:14:51] urandom: depends, you can just a) wait 'til next week and see if it works next time b) re-run the sync by hand see if that goes through c) try the ghost objects cookbook on the affected container(s); the problem is that rclone is inevitably racing with a bunch of other things (expiry, writes/deletes from mw, ...), and sometimes it'll lose [15:15:21] yes, in that error it was trying to copy an object that was in the container listing (when rclone did the listing) but not present (when rclone tried to copy it) [15:15:31] Gotcha. [15:15:41] I'll reset-failed and have a look next week [15:15:51] thanks. [15:23:29] personally I have been delaying the priorirty of those as AFAIK, it is just the catch-up, out-of-band method and I know it is an ongoing issue that happens, but correct me if I am wrong (it should normally not affect availability) [15:23:40] delaying attending them I mean [15:25:28] yeah, if it were persistently failing over a number of weeks it'd become more worrying [15:25:40] Some day, this will all be done Differently & Better (TM) [15:25:43] that was indeed my understanding, thanks [15:45:20] jynus: I'm trying to do my part to improve the signal to noise ratio at alerts.w.o [15:45:22] :) [15:45:43] and I thank you for it!