[05:53:16] marostegui: morning [05:54:31] morning [05:55:13] ahh, you do have a secondary [05:55:21] i wasn't sure, so i alarm-clocked just in case [05:55:42] it is good to have more eyes [05:56:11] i brought these ones: ಠ_ಠ [06:06:01] marostegui: for future cal entries, could you put the phab ticket in the 'description' field instead of the title? that way it's clickable. [06:06:12] ok [06:06:58] ty <3 [07:28:47] How can I have so much email after an only slightly-long weekend? 😿 [07:52:00] Emperor: people Really missed you [08:35:14] marostegui: just set db2093 to read_only (and deployed the monitoring change). nothing _should_ break, but just in case. [08:35:25] oki [08:36:52] icinga is green again, removing the DT. [09:22:58] hey there, tegola is hitting a rate limit on requests to swift and maps is not happy - is this the right place to ask for an increase? [09:24:51] it's certainly the right place to be disappointed [09:24:53] Emperor: ^ [09:38:16] * Emperor doesn't know what tegola is, but would expect the rate limits to be sensibly-specified... [09:39:29] godog will know more of the details, I expect [09:39:44] I'm currently looking at what swiftrepl does and doesn't try to replicate [09:40:15] tegola is our maps tile serving service [09:40:45] it runs in k8s and currently is crashing because we're getting "ServiceUnavailable: Please reduce your request rate" [09:42:42] afaict we're subject to the default ratelimits and temporarily increasing those would help us get out of the hole [09:44:48] does this correspond to some change / is there some monitoring? looking at https://grafana.wikimedia.org/d/kcAMMw4Wk/maps-performances-filippo-t184942?orgId=1 (which I found by browsing, might not be correct) tiles requests basically stopped at around 5pm yesterday [09:45:09] my prior would be that the rate-limiting is done above the swift layer [09:45:29] We're trying to figure that drop out atm [09:45:51] but the failure to serve started at 9:15 today which appears to be when the service itself started failing [09:46:48] that could be a red herring though [09:47:19] there's a related spike in requests to swift at the same time https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=1650273941957&to=1650360281957&viewPanel=24 [09:48:09] hi, checking too [09:48:20] that spike starts when tegola starts doing a batch of pregeneration work [09:49:24] hnowlan: what's the url(s) getting rate limited atm ? [09:50:51] hnowlan: that shows an increase in requests yesterday, not today? [09:51:03] (have to jump in a meeting in 10) [09:51:46] Emperor: yep, I was wrong, service has been having issues since yesterday [09:52:10] Apr 19 00:01:48 thanos-fe1001 proxy-server: ERROR 500 b'' From Container Server 10.64.16.177:6001/sdb3 (txn: tx7ec9f4f1ebf84f95a37d9-00625dfbec) [09:52:14] not very enlightening [09:53:23] Apr 19 00:03:37 thanos-be1002 container-server: ERROR __call__ error with PUT /sdb3/47330/AUTH_tegola/tegola-swift-container/eqiad-v0.0.1/osm/15/9700/21699 : [Errno 28] FALLOCATE_RESERVE fail 0.981685 <= 1 (txn: txcec5326dffe248d2ab66f-00625a3df7) [09:54:15] godog: trying to figure out atm - something along the lines of /tegola-test-map/0/0/0 [09:54:53] some smartd notes on thanos-be1002 [09:55:35] hnowlan: ack, thanks [09:55:48] Emperor: yeah that 500 and container-server error looks like a good lead to me [09:55:50] Apr 19 09:36:06 thanos-be1002 smartd[42122]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 84 to 68 [09:55:50] Apr 19 09:36:06 thanos-be1002 smartd[42122]: Device: /dev/bus/0 [megaraid_disk_09] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 76 to 77 [09:56:16] not really sure if that's noise or signal [09:56:59] can be around for another 5 min, I'm seeing the same errors on thanos-be1001 container server but no obvious failures to disks [10:00:34] So container-updater is also logging e.g. [10:00:41] Apr 19 00:04:22 thanos-be1002 container-updater: Error processing container /srv/swift-storage/sdb3/containers/47330/963/b8e2680a04e5e21073a69d7f59366963/b8e2680a04e5e21073a69d7f59366963.db: [Errno 28] FALLOCATE_RESERVE fail 0.981685 <= 1: [Errno 28] FALLOCATE_RESERVE fail 0.981685 <= 1 [10:01:08] which is ENOSPC [10:01:24] but none of the fs are >71% full [10:03:54] Emperor: any chance the requested space is truly humongous? [10:06:31] hnowlan might know (I don't think the logs expose that) [10:08:04] "By default, fallocate_reserve is set to 1%. In the object server, this blocks PUT requests that would leave the free disk space below 1% of the disk. In the account and container servers, this blocks operations that will increase account or container database size once the free disk space falls below 1%." [10:08:44] So if I'm reading the errors correctly, swift thinks these requests would do so [we don't set fallocate_reserve anywhere, so we get the default behaviour] [10:09:14] Ah, and sdb3 is a small-fast disk [10:09:22] /dev/sdb3 94G 47G 47G 50% /srv/swift-storage/sdb3 [10:09:44] Tegola afaik should not be making single large requests - many smaller ones, yes but nothing enormous [10:09:55] Usually less than a megabyte [10:09:57] but still, you'd need to be pushing ~46G to hit that [10:10:15] plenty of inodes on the fs too [10:10:27] Emperor: is there any way to query swift to confirm that it has defaulted to 1% for that variable? [10:11:29] kormat: I don't know, but the log message is quite suggestive of aiming for 1% [10:12:36] indeed [10:18:16] also, UTSl suggest similar [10:18:34] config_fallocate_value(conf.get('fallocate_reserve', '1%')) [10:22:03] my current hunch is that the tegola container got quite big and its underlying db can't be allocated/copied again [10:22:25] -rw------- 1 swift swift 47604858880 Apr 15 22:34 /srv/swift-storage/sda3/containers/47330/963/b8e2680a04e5e21073a69d7f59366963/b8e2680a04e5e21073a69d7f59366963.db [10:22:39] on thanos-be1001 for example, I suspect that's the tegola container db [10:22:56] Ah, I smell a different rat [10:23:53] no, ignore me, that's wrong [10:27:03] https://phabricator.wikimedia.org/P25307 is concerning, but in a different direction [10:27:56] (that's the calculation done by fs_has_free_space , called by check_free_space in swift/container/server.py) [10:28:27] Emperor: you probably want to use the mountpoint instead of the block device [10:28:35] oh, duh, yes, I had a brain once [10:29:22] (and the result is more sensible then, doh) [10:31:29] hnowlan: were recent changes to tegola that might result in a whole lot of new tiles/objects being created? I see the initial prefix was tegola-cache/osm/... and now it is eqiad-v0.0.1/osm/... [10:31:35] godog: that would be consistent with all the errors coming from container-server rather than object-server [10:33:07] godog: mbsantos or nemo-yiannis will know [10:34:04] this sounds like a very old change [10:34:09] let me check deployment charts [10:34:34] thank you nemo-yiannis [10:36:29] This was introduced a couple of months ago: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759938 [10:37:34] mbsantos: I cant remember if we had live traffic going to `eqiad-v0.0.1` do you remember ? [10:37:46] i think yes [10:38:27] yes, eqiad should have been using this bucket at least for organic traffic for a while now [10:39:09] in terms of getting back in service, how easy is it on your end to switch to a different bucket i.e. away from tegola-swift-container ? [10:39:56] that seems the simplest option to me ATM, start writing to a new container/bucket [10:40:14] Emperor: ^ what do you think ? [10:40:28] presumably if there's a bunch of stuff under tegola-cache/osm, that could all be deleted? [10:40:51] this means that we are going to run the service without any tiles pregenerated which might cause issues [10:40:52] [I don't know how container db size changes with add/delete] [10:41:08] also pregenerating all the tiles on each region will take some time [10:41:17] (as in weeks) [10:41:18] removing tegola-cache/osm looks good to me [10:41:22] (as in days) [10:41:58] yeah tegola-cache can be deleted i think [10:42:24] mmhh I doubt the delete will be effective immediately in shrinking the db, since tombstones are written [10:43:19] nemo-yiannis mbsantos ok in that case would make sense to disable caching on swift (saw your patches in -operations) then clean up the bucket [10:44:30] or start pregen to a new bucket that is [10:44:35] just for clarity, are we expecting files to be deleted after the cleanup? [10:45:20] the tegola-cache/osm that can be deleted yes [10:45:32] ok [10:45:37] 👍 [10:46:31] ok [10:49:33] To be clear - the plan is i) temporarily disable caching on swift ii) delete everything under tegola-cache/osm iii) re-enable caching? And who is doing the deletion thing (Not me :) ) [10:50:19] [presumably with a side-order of "hope there is enough disk space left to actually delete things under tegola-cache/osm" :-/ ] [10:51:05] yes that's correct Emperor [10:51:31] although I am trying to delete e.g. one object under tegola-cache/osm and as expected it doesn't seem to work [10:52:38] but yes first order of things is to get back in service even with caching temporarily disabled [11:00:42] ok updated T306424 [11:00:43] T306424: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 [11:09:42] 45G is an awful lot for a container db... [11:10:16] how many objects are there in there? should it get sharded or somesuch? [11:11:25] yeah ~200M objects AFAICS, definitely sharding is in order [11:11:25] seeing recovery in tegola [11:11:27] [and does it really copy-update-rename on the database file?] [11:12:35] swift docs suggest O(10M) is where you start to see perf issues, this thing notwithstanding [11:12:59] still a bit surprised it wants to allocate the entire DB again for a delete operation [11:14:00] * jynus is surprised to see minio scaling better than swift [11:15:43] ok since maps/tegola seems to be back I'm going to grab some lunch and will be back later for followups on the swift side [11:23:42] just out curiosity: can we temporarily use another bucket/container for the interim state just to get some caching [11:23:44] ? [11:56:42] nemo-yiannis: yes I think that'd be optimal [11:58:00] can you create one for us with access to same users so we can quickly add some caching at least to our current state? [11:58:54] nemo-yiannis: for sure, what name would you like ? [11:59:30] `tegola-swift-fallback` [12:00:35] ok, sec [12:02:10] nemo-yiannis: {{done}} [12:02:16] thanks! [12:25:46] godog: is that one sharded? [12:26:02] godog: and any joy with removing things from the 200M-entry container? [12:29:33] Emperor: not sharded no, we don't have the container-sharder deployed, and no joy yet with the delete straight away but I'm preparing a patch to disable fallocate temporarily [12:29:44] Amir1: can you refresh www-data 7549 27.7 0.3 445560 260280 pts/1 S+ Mar31 7520:40 php -ddisplay_errors=On /srv/mediawiki/multiversion/MWScript.php maintenance/migrateLinksTable.php --wiki=frwiki --table templatelinks [12:29:55] or does it refresh itself? [12:37:41] worries me that the maps container use is only seemingly a factor of 2 away from breaking thanos-swift :-/ [12:42:33] godog: just a heads up we started using `tegola-swift-callback` [12:42:38] *fallback [12:44:13] nemo-yiannis: ack! thanks for the heads up [12:44:40] Emperor: there's a bunch of followups to do for sure to ease the single container usage [12:45:36] Amir1: No need to do it anymore, I ran the change manually [12:46:55] godog: Mmm. I'm going to go back to looking at swiftrepl for now (since that's blocking bullseye upgrades), shout if you need more input from me? [12:48:29] Emperor: I will! thank you [12:48:47] Emperor: actually before you go, https://gerrit.wikimedia.org/r/c/operations/puppet/+/784250 [13:04:57] 👀 [13:13:19] So where I was before this blew up was looking at what containers swiftrepl does and doesn't replicate. It replicates 37979 out of 43012. Of the remaining 5303, nearly all are things containing 'local-temp' in the name. The four exceptions are: monitoring/ root/ wikipedia-commons-gwtoolset-metadata/ wikipedia-test-testing/ [13:15:42] I have two questions (probably for godog unless you want me to instead ask the original author ;-) ) - are we sure we don't want to be replicating wikipedia-commons-gwtoolset-metadata ? and that notwithstanding, presumably we still want to only continue to replicate a subset of the containers in ms ? [13:16:29] (I'm wondering whether an exclusion-rule base might be better if we're not entirely sure we won't end up with odd missed containers in future) [13:21:29] Emperor: yes a subset AIUI is what we want, I don't know what's up with wikipedia-commons-gwtoolset-metadata though e.g. if it is in active use or a relic [13:22:43] asking mediawiki for the list of containers is probably the most future proof solution I can think of, though I don't know what that entails in practice [13:26:40] godog: I'm blocking on os upgrades with this, so I'd rather not try and make the perfect the enemy of the good IYSWIM; rclone lsd suggests it's in current use [13:26:46] 4649205964 2022-04-19 13:26:07 2982 wikipedia-commons-gwtoolset-metadata [13:27:17] oh, except they all have that date. [13:33:28] looking in the contents, there are elements from this year (only a few mind), so I think it should be replicated. Any idea who I would ask about what this for, though, rather than just "it seems to be in use..."? [13:35:14] (4 entries out of 2982) [13:36:01] Emperor: no idea who to ask tbh, and +1 to a good tradeoff for now. Perhaps Amir1 might know who to poke, I'd say it does need to be replicated [13:36:46] It's presumably related to https://www.mediawiki.org/wiki/Extension:GWToolset [13:39:06] Sorry I'm slowly waking up [13:42:22] godog: Tegola is currently using the fallback container in both eqiad and codfw k8s deploymnets. Codfw looks like its working OK but eqiad cache operations look like they have a lot of latency [13:42:26] Is this expected ? [13:53:22] nemo-yiannis: I'd expect performance to be basically the same in eqiad and codfw, I'm deploying a change to swift to be able to delete old tegola objects tho, there might be further latency spikes as the roll restarts happen [13:53:31] ok [13:53:33] good to know [13:53:45] hnowlan: ^ [13:53:51] nemo-yiannis: which dashboard(s) are you looking at ? [13:54:45] https://grafana.wikimedia.org/goto/2xLrwWQnz?orgId=1 [13:54:51] this is the app metrics for tegola [13:56:57] ack, thanks [13:59:32] Amir1: maintenance/migrateLinksTable.php - is that your script? [13:59:46] kormat: yup, which section? [14:00:11] I need to make it update from time to time [14:00:32] Amir1: s2 [14:00:57] host db1129 [14:01:09] kormat: it should be reset now [14:01:26] Amir1: cheers [14:09:50] Amir1: taavi says that toolsdb replication is currently not working, which bodes ill for the upcoming move. Can you investigate with him? I have to go into a meeting in just a moment. [14:10:20] sure, what is the db, etc.? [14:10:39] master is clouddb1001.clouddb-services.eqiad1.wikimedia.cloud, replica is clouddb1002.clouddb-services.eqiad1.wikimedia.cloud [14:13:03] andrewbogott: can I be added the project? I can't login there [14:13:15] yep, shell name 'ladsgroup'? [14:14:34] Amir1: ^ ? [14:14:42] yup [14:14:45] ok, done [14:14:51] Thanks [14:17:46] andrewbogott: maybe I need to give it time but I can't login and it's not showing up in https://openstack-browser.toolforge.org/project/clouddb-services [14:20:00] Amir1: how about now? [14:20:01] try now? I just tried to flush the caches? [14:20:25] I can login now [14:20:26] thanks [14:20:54] godog: FWIW, the two versionf of wikipedia-commons-gwtoolset-metadata are mostly in-sync - rclone would copy 4 files eqiad->codfw (and likewise delete them if run codfw->eqiad) [14:22:09] Emperor: ack! thanks that's good to know [14:23:28] taavi: the read master log position is advancing in 1002: 51414697 [14:24:08] Amir1: is the 'Last_Error: Could not execute Update_rows_v1 event on table s54518__mw.online; Can't find record in 'online', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.282696, end_log_pos 68409765' unrelated then? [14:24:47] and where are you seeing that error? [14:24:58] you can also simply make it skip a transaction [14:25:59] `SHOW SLAVE STATUS\g` on clouddb1002 [14:28:13] taavi: I can't say for sure but that didn't seem to have caused replication to stop. I can make a test change and see if it replicates [14:28:28] have you tried that? [14:29:39] not yet, let me try it [14:31:36] hmm, I created a db called foooooo and it's not replicated [14:31:55] nope [14:34:30] I set the Skip_Counter to 1 [14:34:36] https://dev.mysql.com/doc/refman/8.0/en/replication-administration-skip.html#replication-administration-skip-nogtid [14:34:52] I have a feeling it has a lot to catch up [14:35:01] it's advancing but waaay behind [14:37:17] :/ [14:39:09] aah, I think I found it [14:39:22] let me see if it finally catches up [14:41:12] do you know what went wrong with it? [14:43:10] taavi: the data didn't match in that particular db/table and since it's row based replication (which I think should be SBR instead), it broke replication [14:43:28] I just made it skip that transaction and restarted replication [14:44:30] the other mistake was that I was looking at Read_Master_Log_Pos and thinking the replication is happening but I should have looked at Exec_Master_Log_Pos (I need coffee) [14:47:19] Is it possible to come up with an ETA for when it will catch up? I'm nervous about shutting down the primary if the secondary isn't up to date [14:47:35] We could also switch the primary to read-only for a bit before the downtime in order to help with replication [14:47:59] it's really hard, depends on the what is being passed, I see it choke multiple times [14:48:06] (in large transactions) [14:48:16] making it read-only would definitely help [14:54:19] OK, I'm out of my meeting now. Can I get some idea of the scope of what is/isn't replicated? Is it only one table that's behind, or a whole lot of things? [14:55:35] I'm not following, currently it's just behind from what I'm seeing [14:55:53] and it's serial so any change in the past ten minutes or so [14:56:04] ok, so we're minutes behind, not hours or days? [14:58:30] Amir1: ? [14:58:42] I don't think so, at least from the log position and log file [14:58:45] but I can be wrong [14:58:55] how long the replication was broken? [14:59:10] No idea [14:59:44] estimated time? Days, months? [14:59:47] Amir1: the question 'so we're minutes behind, not hours or days' was not a yes or no question so I'm not sure what 'I don't think so' means [15:00:20] I meant I don't think we are days behind [15:00:25] oh good :) [15:00:42] OK, I'm going to send an email saying that the window is starting. Then let's switch to read/only and see if it catches up. [15:00:49] Do you know how to do the read-only switch? [15:01:05] andrewbogott: ClouddbServicesToolsDBReplication first fired at 7am from what i can see [15:01:11] 9 hours ago [15:01:22] FLUSH TABLES WITH READ LOCK; [15:01:22] SET GLOBAL read_only = 1; [15:01:29] let me know when I can run it [15:01:43] Amir1: go ahead and do it now, we're in the outage window. Thanks! [15:02:28] started [15:02:41] the first command is taking long [15:03:05] Seems reasonable, it's likely very busy [15:07:26] clouddb 1002 slave status shows this btw: 'Seconds_Behind_Master: 37080' [15:08:36] ok, let's hope that catches up fast once we're r/o [15:09:12] last time we had replication broken it was catching up roughly real time (1 second per second), but that was without r/o [15:09:24] yeah, it looks like 1:1 right now too :( [15:10:03] Amir1: are we r/o now or is it still hanging on the flush? [15:10:17] no [15:11:12] the master pos is not advancing as far as I see so it's RO [15:11:29] 1:1 is really bad tbh [15:11:44] it should be something like 1:5, at worst 1:2 [15:12:09] I'm trying to determine the rate now... [15:12:21] Amir1: it would actually advance at something like 2:1, if it wasn't because we skip some heavy write dbs [15:14:39] jynus: es? [15:14:44] It's currently worse that one to one. It advanced 35 seconds in 1:42 [15:15:19] Amir1: sorry, I didn't get the "es" part [15:15:21] it might be that there is a massive write [15:15:28] jynus: External Storage [15:15:44] andrewbogott: from what I'm seeing it's gonna take a while [15:15:50] no, I am talking about toolsdb [15:15:55] toolsdb uses local storage [15:16:10] I mean, if the replication rate is less than 1:1 then it will never catch up, just be more behind the longer we wait [15:16:12] in r/w [15:16:22] yeah [15:16:50] context is: https://phabricator.wikimedia.org/T127164 [15:17:03] ratio was less than 1:1- it would never catch up [15:17:27] the thing is from what I'm seeing the read of binlog is done, the files have been moved but it has not got to execute them [15:17:33] it was achieved to be higher than 1:1 after 3 dbs were filtered out, with consent and awareness from users [15:17:52] so I think shutting down and re-imaging the db would be fine? [15:18:00] yeah, that is normal behaviour if there is a long running transaction [15:18:10] e.g. an import [15:18:28] ok, so we need to make a decision now. We can delay this upgrade window by 2 days and hope that it's caught up by Thursday. Or we can just cross our fingers and upgrade the HV without up-to-date replication. OR, we can [15:20:33] andrewbogott: so what if you just show them the replica? From what I'm seeing the data is moved over, just not replayed yet [15:20:45] i.e. how terrible would that be [15:21:19] Failover is difficult -- also, I'm not clear on how/what that helps with [15:22:06] I mean you just go ahead with the reimage as the data is moved over to the replica, just not executed yet [15:22:55] So you're saying that we wouldn't actually lose data since the replica can continue to catch up after the primary is down? (due to just needing to execute things?) [15:23:45] yeah but I need to double check [15:24:06] If that turns out to be right then that would be great, could move ahead [15:24:15] my suggestion, first shut down mysql in master and see if the replication advances [15:24:28] replag is increasing now :( [15:24:36] andrewbogott: yes, in theory the replica can replicate now without the primary if all binlogs have been sent already [15:24:41] I don't even see how that's possible with a r/o primary [15:25:24] you really should tell some users to write this heavily :/ [15:25:27] as long a io_thread is up to date with current master position, the sql_thread can run independently [15:25:37] I can't say SBR is faster but it might help? [15:25:48] Amir1: did you confirm that binlogs are all copied? [15:25:53] Amir1: SBR broke replication every week [15:25:59] lovely [15:26:05] users don't do safe queries [15:26:09] we NEED row replication [15:26:20] note they mostly use MyISAM [15:26:36] or a signifincan number of them do [15:26:55] andrewbogott: I can't find the binlog files in the replica but Read_Master_Log_Pos is updated [15:27:08] maybe mysql stores them in somewhere else [15:27:16] ok [15:27:28] Amir1: I want to remember it was on a binlog dir on a different location [15:27:53] I did a find on /srv/ [15:28:00] nothing showed up [15:28:04] I think Yuvi set them like that, different from how production was [15:28:14] I'm going to stop the mariadb server on that host and then shut down clouddb1001 [15:28:18] Unless someone tells me not to :) [15:28:35] andrewbogott: my only request, wait a bit after sutting down mysql [15:28:47] so I can check if the replication is advancing [15:29:02] Amir1: /srv/labsdb/binlogs I think [15:29:07] Amir1: clouddb1002:/srv/labsdb/binlogs [15:29:14] ok, it's stopped [15:29:35] lmk when you're ready for me to shut down that server [15:29:54] the relay logs are still on: /srv/labsdb/data/clouddb1002-relay* [15:30:07] the replication is advancing [15:30:28] andrewbogott: the files in clouddb1001 won't be deleted, just in case? [15:30:56] Amir1: If all goes well the VM won't be damaged in the operation [15:31:42] let me copy paste the binlogs to my home dir, just to be safe [15:32:09] Amir1: AFACS, everything looks good- except probably we are updating a table without primary keys [15:32:12] your home dir on clouddb1001? That's on the same storage as the db [15:32:24] which may take forever [15:32:40] jynus: sigh [15:32:49] andrewbogott: ok then [15:33:02] ready for me to stop the VM? [15:33:29] go ahead [15:35:14] the other thing I mentioned to dcaro in the past is enabling gtid on the replica would help against corruption [15:35:22] ok! HV reimage is starting [15:35:28] as it enabled innodb-based replication control [15:35:40] will take 20 minutes or so [15:35:41] probably the breakage comes from a past crash on the replica [15:35:56] and that would minimize it- but not sure it will work well with replication filters [15:36:04] I honestly think first we need to find a way to at least make the biggest users do something better [15:36:08] I definitely stopped the replica yesterday but as far as I know it was graceful [15:36:24] Amir1: ideally, split the db into smaller chunks [15:36:34] once the replication is in a healthier state, you can add more checks on it [15:36:40] This needs an almost total redesign, possibly with a private database server for each heavy user [15:36:41] as in, several instances in parallel [15:36:49] we have some (long-term) plans for doing that with openstack trove, but that's blocked on several different things atm :( [15:36:52] but who can blame cloud, when they are already understaffed [15:37:12] yeah [15:37:18] Now I have to worry about whether or not pxe-booting works :) [15:37:46] I was wondering if the heavy users can use some DB expertise to redesign their system. I'm sure most of them don't really want to bring clouddb to its knees [15:37:58] andrewbogott: lol fun stuff [15:38:07] well, moving out the heavy hitters from the task should be the priority [15:38:23] that way you can have normal replication, without filters [15:38:33] \o/ pxe worked [15:38:48] I think T236101 is already filled about that [15:38:49] T236101: Find a way to remove non-replicated tables from ToolsDB - https://phabricator.wikimedia.org/T236101 [15:39:00] btw, the replication log file is at ...700, need to reach ...742 [15:40:31] I confirmed they are long running transactions- or poor replicating ones (what I mentioned about lack of pks) [15:40:53] as exec coord jumped like 1 million positions in a blink of an eye [15:47:19] 1019 is puppetizing now, which takes forever but is a good sign [15:54:15] Emperor: godog I was looking for this. GWT is unmaintained and should be undeployed. Here is the ticket T270911 but I don't know why it's stuck. OTOH, it's probably used by users until we remove it so that file is probably used. OTOOH (I have three hands), I don't understand why on earth this needs a file in swift, it probably can be replaced with a redis lockmanager or any other sort of lock manager we have [15:54:16] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [15:55:01] HTH [15:58:05] If it's a hassle to replicate it, let me know and I can spend some time making it not use swift but if it's easy, I suggest replicating [16:00:38] Amir1: thank you <3 <3 super useful, agreed might as well replicate for now (I think, not necessarily my call) [16:03:34] the code for it quite weird and complicated and I have never done work on it [16:08:53] clouddb1001 is back up! Going to start mariadb [16:11:39] Amir1: can we switch that back to r/w? Or are there other things you'd like to chase down with replication? [16:11:49] 2002 says 'Seconds_Behind_Master: 33938' [16:12:00] which is a little bit better [16:12:19] although now it seems to be increasing again :/ [16:13:53] andrewbogott: sorry at middle an outage :( [16:13:58] the replication is cacthging up [16:14:36] Amir1: can you switch it back to r/w? [16:14:56] andrewbogott: when needed? yes [16:15:16] I mean, can we do that now? so that tools start working again? [16:17:13] replag doesn't affect users but r/o sure does [16:18:49] jynus: a final check re https://phabricator.wikimedia.org/T304461, is it ok if i run the script now from backup POV? [16:18:59] urbanecm: sure [16:19:04] go ahead [16:19:06] thanks! [16:19:08] * andrewbogott runs 'SET GLOBAL read_only = 0;' [16:19:14] * urbanecm hides [16:22:01] andrewbogott: yeah, that'd work [16:22:07] sorry. We are having an outage atm [16:24:12] Understood :) Things seem to be back to the not-great-but-intended state that they were in before the upgrade, so I think you can stand down. Thank you for your support! [16:40:42] andrewbogott: thanks. fwiw, it's advancing and the lag is decreasing [16:40:56] that seems good [16:42:16] did we have any other immediate followups other than gtid (T301993)? [16:42:19] T301993: [toolsdb] Enable gtid to help replication recovery - https://phabricator.wikimedia.org/T301993 [16:42:57] taavi: I don't think so -- we didn't learn anything bad about our system today that we didn't already know. [16:43:39] (thanks to you, taavi, for noticing that replication issue before upgrade) [16:47:07] taavi: probably checking if there are any myisam tables still left and if so, try to check if they can be migrated to innodb or if there's an specific reason why they are on myisam [16:50:11] taavi: replication should be alerted on- maybe not paged for this specific service [16:50:34] but it should be noted more proactively (maybe it is already an alarm for cloud, just I don't notice it) [16:54:36] That's a good point -- taavi did you notice because of an alert email or because you were looking at a dashboard? [16:55:24] I was looking at https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1&var-dc=Tools%20Prometheus&var-server=clouddb1002.clouddb-services.eqiad.wmflabs&var-port=9104 [16:56:18] I threw together a quick prometheus alert based on `mysql_slave_status_last_errno` this morning, but I'm sure it could be improved somehow [16:57:12] something is a lot better than nothing! [16:58:15] do I have a patch about that in my ever-growing tbr list? [16:58:36] no, that's in the metricsinfra instance where I have access to do it myself [16:59:26] I guess the main problem with that is you didn't notice it, it only does irc alerts to -feed and emails (it's in the cloud realm so it can't send actual pages) [16:59:54] an email should've been good enough, let me see if I got it... [17:00:31] I see "[FIRING:1] ClouddbServicesToolsDBReplication ..." on cloud-admin-feed@ [17:01:07] yep, got it and it got filtered into a place i don't look very often. Will fix that on my end. [17:09:59] filed a few tasks for follow-up: T306453 T306455 thank you! [17:09:59] T306455: toolsdb: reduce number of myisam tables - https://phabricator.wikimedia.org/T306455 [17:10:00] T306453: toolsdb: review alerting - https://phabricator.wikimedia.org/T306453 [17:16:52] taavi- check production repo- there is relatively quite reusable alerting options [17:18:21] (the dbas have memory, read only, event, disk space, replication (including heartbeat) alerts ready to use [17:19:19] will definitely check those out, thanks! the problem is that we currently don't have icinga on the cloud realm, so we'll need to rely on the information in prometheus [17:19:43] that is ok, some are prometheus based [17:20:03] and some are easilly changable [19:31:22] Looks like the annual survey review overlaps our weekly meeting. I'm going to delay the start of our weekly if no one objects. [19:37:49] wrong channel sorry ^