[01:08:59] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 17.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:09:21] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 26.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:33] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 13.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:11:59] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:12:31] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:13:49] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [04:55:03] brennen: I'm going to start replication on that last host [04:55:35] And also failover phabricator proxy, which should be transparent to all this [05:35:03] done [15:43:28] * dr0ptp4kt marostegui: still around? if so, got a moment on the querysampler id? if not, or if you're wrapping for the week, understood! (lmk if i should file a task or if is should ping here monday if you're wrapping for the week) [15:50:14] dr0ptp4kt: Yeah sorry, confirmed, there is no such user [15:50:30] We probably didn't even migrate it back in 2018 or something when we switched from the old replicas to the new infra [16:03:09] marostegui: yeay, i'm guessing after brooke was done with https://phabricator.wikimedia.org/T272723 (?). [16:03:09] dr0ptp4kt@clouddb-wikireplicas-query-1:/srv/queries$ ls -aldrst $PWD/* [16:03:09] 326904 -rw-rw---- 1 root project-clouddb-services 334749696 May 10 2021 /srv/queries/sampler.db [16:04:24] what would be the best way to restore the id? one TODO in my analysis is to profile usage a bit [17:05:18] dr0ptp4kt: I don't know, it's been a few years, I'd need to check what the original goal for that was and how it worked [19:41:55] A daemon was polling every 10 to 4000 seconds with https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/wmcs/db/wikireplicas/querysampler.py . The purpose was to check if multi-instance queries would break; looks like data crunching maybe didn't happen - https://phabricator.wikimedia.org/T267989#6966417 . [19:45:15] In my case, if we just get the same daemon back up and running with the same `querysampler` user, the data it produces would do the trick nicely, as it'd have the DB username, target database, and query plus datetime for the samples on information_schema.processlist. That'd let me aggregate on the interesting things. [19:48:17] The grant needed would be https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/mariadb/grants/wiki-replicas.sql$76 on each replica behind these from querysampler-config.yaml.erb: [19:48:20] localdb: /srv/queries/sampler.db [19:48:20] password: <%= @replicapass %> [19:48:20] user: <%= @replicauser %> [19:48:20] hosts: [19:48:20] - dbproxy1018.eqiad.wmnet [19:48:20] - dbproxy1019.eqiad.wmnet [19:48:20] - dbproxy1018.eqiad.wmnet [22:13:03] (I'm not sure why dbproxy1018.eqiad.wmnet is listed twice in the .erb, maybe it was to try to force a weighting or something, as https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/wmcs/db/wikireplicas/querysampler.py$75-77 just iterates through the 'hosts' entries. So that is something I'd need to change and I would also need to make it iterate over the ports for all shards.) [22:14:53] (I'm thinking I could just screen a manually executed -d call of this script after making the modifications, unless there's some reason we really need to daemonize again.) [22:25:29] I guess another approach could be enabling server_audit, but that seems more involved, at least in the short run. Heading out, have a good weekend.