[05:03:24] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [05:25:14] cool [05:49:14] nice work [05:55:42] I'll write something to ops@ later [06:45:36] Amir1: https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance the entry for codfw is showing the wrong list, as it has s1, s7 and s8 but pointing to the same task (which is about s8) [06:46:11] marostegui: that is actually correct [06:46:22] when you upgrade the host, you have to depool all instances [06:46:44] which means you depool in s2/s1/s... [06:46:51] ah it takes into account the multi-instances? [06:47:45] yup [06:47:51] ah ok cool [07:11:45] Re: new packages, are they bullseye only? [07:11:56] yes [07:12:01] you want me to do it for buster too? [07:12:13] for now I just wanted to know :-) [07:12:23] oki, if you need them just let me know [07:12:46] as long as you are ok with waiting a few weeks for source upgrades, it should be ok [07:12:53] yeah, absolutely [07:17:02] marostegui: I think this forces us to wake up sooner so we can call dibs on sections before other DBAs [08:17:39] I changed grants of prometheus on localhost in a big set of dbs (tested in a small set first), let me know if things go bad [08:28:44] We have our first 10.6 host, the testing slave: https://phabricator.wikimedia.org/T301879#7727116 [08:29:01] \o/ [08:33:54] so how did compression got affected- is it uncompressed, or does it convert to the new systems automatically? [08:35:05] ah, I just read "This plan has been scrapped, and from MariaDB 10.6.6, COMPRESSED tables are no longer read-only by default." [08:35:25] so they undo the change, interesting [08:35:43] jynus: yeah, that has changed after the feedback at: https://jira.mariadb.org/browse/MDEV-22367 [08:36:13] jynus: I provided quite a bunch of reasons why it wouldn't work for us [08:36:52] but do you know what the state is- is is backwards compatible but mostly unsupported and recommended to convert, do you know details? [08:37:22] no, as far as I know the main reason to remove support (a bug) was fixed, so things should be as they are now [08:37:51] https://jira.mariadb.org/browse/MDEV-22367?focusedCommentId=213009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-213009 [08:39:29] still, on the wiki they call page compression "superior" [08:41:16] (PrometheusMysqldExporterFailed) firing: (7) Prometheus-mysqld-exporter failed (db1154:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [08:41:17] for context, my worry is making sure backups are in the most compatible format for use in the future- so please tell me if there are any news or changes in what wmf dbs should do [08:41:35] (in the future) [08:42:13] yeah, will do, no worries [08:42:32] but cool it is not an immediate worry! [08:46:16] (PrometheusMysqldExporterFailed) firing: (7) Prometheus-mysqld-exporter failed (db1154:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [08:50:46] it is weird, otrs backups on eqiad are taking a lot of time, will keep it monitored but give it a few more hours, as things seem to be happening [08:51:16] (PrometheusMysqldExporterFailed) firing: (67) Prometheus-mysqld-exporter failed (db1099:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [08:52:10] Amir1: ^ related to the grants removal perhaps? [08:52:22] I check [08:52:26] very very likely [08:53:29] marostegui: it should be recovered after I moved back everything [08:53:40] and grafana is saying that it's recovered [08:54:48] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1099&var-port=13311 [08:59:22] and broke again :/ [08:59:28] I didn't change anything [09:00:00] aah, false alarm, I'm actually stopping that host for bullseye upgrade [09:01:16] (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db1099:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [09:01:37] I resolved it, I think it will recover, if not I will look into it [09:01:45] yeah, I saw a few complaints at the end of https://grafana.wikimedia.org/d/000000278/mysql-aggregated but only worth looking if they are there for a long time (otherwise, they may be just temporary) [09:02:07] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1101&var-port=13318 [09:47:19] <_joe_> good morning folks [09:47:35] <_joe_> I'd like to get an ETA on the removal of the blockers for upgrading cumin1001 [09:47:42] <_joe_> because now that's blocking my OKRs [09:48:03] _joe_: Is the blocker migrating to bullseye? [09:48:08] I don't recall them [09:48:08] <_joe_> yes [09:48:10] <_joe_> T276589 [09:48:11] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [09:48:25] <_joe_> marostegui: one thing was tendril, which is gone [09:49:01] _joe_: So https://phabricator.wikimedia.org/T298585 is the main task for our migration to bullseye [09:49:14] We are going in a good pace, but we still have to switch all masters [09:49:44] <_joe_> marostegui: sorry I'm talking about the cumin host [09:49:47] <_joe_> not the databases [09:50:09] _joe_: But migrating cumin to bullseye is blocked on us migrating the databases? [09:50:25] is it? [09:50:28] <_joe_> wat? [09:50:33] <_joe_> that makes zero sense [09:50:44] I don't know, I am asking XD [09:50:49] I don't know which are the blockers [09:51:09] https://phabricator.wikimedia.org/T276589#7420124 [09:51:11] <_joe_> yeah not the upgrades to bullseye of the databases [09:51:15] I don't know if that happened already or not [09:51:45] <_joe_> I assume it hasn't then [09:51:52] <_joe_> can I help? [09:52:05] <_joe_> I really got to get the kubernetes library for cookbooks merged [09:52:51] <_joe_> marostegui: to be clear this is a question to the team, not to you [09:53:04] ok [09:53:28] I don't know what's the status of that, sorry! [09:58:05] on my side, we have to test db backups on bullseye, which is happening this quarter [09:58:18] but I expect no issues from that [09:59:46] <_joe_> ok, given it's what, 1 year all of SRE is held back by this process, can we upgrade cumin1001 to bullseye and just leave cumnin 2001 to buster for the db tooling while you get to the bottom of it? [10:01:06] worse case scenario we could also split the functionalities into different hosts, but will be quite a pain to setup and maintain [10:01:39] <_joe_> yeah, no. [10:01:53] I think for coordination, you should talk to managers, they decide what we focus on each quarter [10:02:05] and given we have a shared one, that should be easy :-D [10:02:13] <_joe_> jynus: 1) no we don't [10:02:30] <_joe_> 2) please don't answer techincal questions with "talk to managers", thanks [10:03:18] <_joe_> anyways, sure I will, don't worry, talk to the managers [10:06:25] _joe_: hi, i'm probably the one you want to talk to. (well, "want" is a strong word) [10:07:56] <_joe_> kormat: ok, so my 2 questions are - 1 when can we upgrade the cumin hosts to bullseye [10:08:04] or, at least, i'm responsible for supporting wmfmariadbpy + wmfdb, which are part of our db tooling. the other parts wmfbackups/transferpy are managed by jynus. [10:08:25] <_joe_> 2 if that's later than "in a couple weeks", can we upgrade 1001 in the meantime and leave just 2001 on buster? [10:09:25] _joe_: i'll need to check with lukasz about priority ordering of stuff, but assuming he's ok with it, we can probably do 'in a couple of weeks' [10:11:34] <_joe_> kormat: an update on T276589 would be really appreciated once you have an answer [10:11:35] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [10:11:41] <_joe_> I'm ok with both options btw [10:11:51] _joe_: understood [10:17:41] <_joe_> thanks! [10:30:45] marostegui, Amir1: there's a job running on mwmaint1002 against db1110 (s5) that's been running since feb 16th. [10:30:50] `extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=dewiki` [10:30:58] this seems like the sort of thing that's unlikely to finish soon? [10:31:02] that's Amir1's clean up I believe [10:31:07] yeah [10:31:17] let me check, it should have some sort of restarts [10:31:57] gone now, I'll make it loop [10:32:59] great, thanks [11:30:33] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 30 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:31:47] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [12:55:35] Amir1: your favourite script is holding a connection open to db1127 (s7) now :) [14:12:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2093:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:14:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (matomo1002:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:17:16] (PrometheusMysqldExporterFailed) firing: (5) Prometheus-mysqld-exporter failed (es1022:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:20:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-test-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:22:16] (PrometheusMysqldExporterFailed) firing: (8) Prometheus-mysqld-exporter failed (es1021:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:25:16] (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:27:16] (PrometheusMysqldExporterFailed) firing: (11) Prometheus-mysqld-exporter failed (es1020:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:29:16] kormat: ^ these look related to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/b84c1fa2b8fe1c3b65b423933b493b1859645875 :( sorry [14:30:16] (PrometheusMysqldExporterFailed) firing: (3) Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:32:16] (PrometheusMysqldExporterFailed) firing: (16) Prometheus-mysqld-exporter failed (es1020:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:33:37] taavi: oh! having a look [14:34:12] `Feb 22 14:29:10 es1020 prometheus-mysqld-exporter[1254184]: time="2022-02-22T14:29:10Z" level=error msg="Error scraping for collect.heartbeat: strconv.ParseFloat: parsing \"\": invalid syntax" source="exporter.go:171"` [14:34:16] ok. rolling back. [14:37:13] ohh. and also grants issues. [14:37:16] yeah, this is a mess. [14:37:16] (PrometheusMysqldExporterFailed) firing: (17) Prometheus-mysqld-exporter failed (db1105:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:39:22] :( [14:42:14] taavi: not your fault, to be clear! [14:42:16] (PrometheusMysqldExporterFailed) firing: (17) Prometheus-mysqld-exporter failed (db1105:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:44:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (matomo1002:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:44:17] taavi: yeeah, and some hosts don't have a heartbeat database at all [14:44:38] (e.g. matomo1002) [14:45:10] taavi: so.. i'm afraid the only way to make this work would be to refactor the puppet classes a bit to allow you to provide an additional list of flags/collectors [14:45:41] yeah.. although that doesn't seem like a bad idea overall [14:45:54] I'll take a look at that later today / this week [14:45:58] 👍 [14:46:09] sorry that the chaos is production ruined your nice quick fix :) [14:46:41] heh.. no worries, I still need to get pt-heartbeat actually running on the cluster I want to monitor [14:47:16] (PrometheusMysqldExporterFailed) firing: (17) Prometheus-mysqld-exporter failed (db1105:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:47:23] taavi: note that prod runs a _modified_ form of pt-heartbeat. it shouldn't matter for your purposes, but i wanted to mention it. our version adds a couple of extra flags/columns [14:52:16] (PrometheusMysqldExporterFailed) resolved: (17) Prometheus-mysqld-exporter failed (db1105:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [14:55:16] (PrometheusMysqldExporterFailed) firing: (3) Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [15:00:16] (PrometheusMysqldExporterFailed) resolved: (3) Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [15:00:52] ok. that's maybe the last of the alerts. [15:16:30] I'm back from lunch [15:33:40] kormat: that script restarts every ten minutes [15:34:18] (if you mean my scripts, if you mean the dumper, than we it's also another mess) [15:34:56] Amir1: i'm referring to pid 9270 on mwmaint1002 [15:35:07] it's been running since feb16th, and has had a connection open to db1127 for some hours now [15:35:55] aaah [15:37:18] yeah, that's when it reached huwiki, it'll take a while, for most wikis it should be quick but flaggedtemplates is a mess [15:37:44] e.g. it was 3B rows in dewiki, now it's around 800M [15:37:55] afk [15:39:26] kormat: let me know if another host still keeps connection [15:41:07] Amir1: nothing has changed? process still running on mwmaint1002, still has connection open [15:41:31] yeah, I think it should ignore this host for now [15:41:41] which "it"? [15:42:05] db1127 [15:42:20] the schema change script [15:42:20] you think db1127 should "ignore" mwmaint1002? [15:42:30] no [15:42:39] that's a different story altogether [15:42:48] T298485 [15:42:49] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [15:42:58] ok. can you please state in a full sentence what you mean? :) [15:43:06] which is unfortunately not easy to fix [15:43:38] what I'm saying is that the schema change script should ignore db1127 for now until the mwmaint script finishes on huwiki [15:43:48] which hopefully will be soon, let me check how far it is [15:44:35] the schema change script is _only_ targetting db1127 right now. because depooling it has failed twice. [15:44:54] hmm, okay [15:45:11] then let me try something [15:46:41] kormat: can you see if it's going through [15:46:50] yes, finally. [15:47:49] Amir1: thanks! [15:50:58] I need to focus on connection handling in mw soon, once I'm done with some of my current work [15:51:08] it's hindering a lot of work of us