[01:08:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 17.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:09:21] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 26.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:09:33] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 13.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:11:59] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:12:31] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:13:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[04:55:03] <marostegui>	 brennen: I'm going to start replication on that last host
[04:55:35] <marostegui>	 And also failover phabricator proxy, which should be transparent to all this
[05:35:03] <marostegui>	 done
[15:43:28] * dr0ptp4kt marostegui: still around? if so, got a moment on the querysampler id? if not, or if you're wrapping for the week, understood! (lmk if i should file a task or if is should ping here monday if you're wrapping for the week)
[15:50:14] <marostegui>	 dr0ptp4kt: Yeah sorry, confirmed, there is no such user
[15:50:30] <marostegui>	 We probably didn't even migrate it back in 2018 or something when we switched from the old replicas to the new infra
[16:03:09] <dr0ptp4kt>	 marostegui: yeay, i'm guessing after brooke was done with https://phabricator.wikimedia.org/T272723  (?).
[16:03:09] <dr0ptp4kt>	 dr0ptp4kt@clouddb-wikireplicas-query-1:/srv/queries$ ls -aldrst $PWD/*
[16:03:09] <dr0ptp4kt>	 326904 -rw-rw---- 1 root project-clouddb-services 334749696 May 10  2021 /srv/queries/sampler.db
[16:04:24] <dr0ptp4kt>	 what would be the best way to restore the id? one TODO in my analysis is to profile usage a bit
[17:05:18] <marostegui>	 dr0ptp4kt: I don't know, it's been a few years, I'd need to check what the original goal for that was and how it worked
[19:41:55] <dr0ptp4kt>	 A daemon was polling every 10 to 4000 seconds with https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/wmcs/db/wikireplicas/querysampler.py . The purpose was to check if multi-instance queries would break; looks like data crunching maybe didn't happen - https://phabricator.wikimedia.org/T267989#6966417 .
[19:45:15] <dr0ptp4kt>	 In my case, if we just get the same daemon back up and running with the same `querysampler` user, the data it produces would do the trick nicely, as it'd have the DB username, target database, and query plus datetime for the samples on information_schema.processlist. That'd let me aggregate on the interesting things.
[19:48:17] <dr0ptp4kt>	 The grant needed would be https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/mariadb/grants/wiki-replicas.sql$76 on each replica behind these from querysampler-config.yaml.erb:
[19:48:20] <dr0ptp4kt>	 localdb: /srv/queries/sampler.db
[19:48:20] <dr0ptp4kt>	 password: <%= @replicapass %>
[19:48:20] <dr0ptp4kt>	 user: <%= @replicauser %>
[19:48:20] <dr0ptp4kt>	 hosts:
[19:48:20] <dr0ptp4kt>	   - dbproxy1018.eqiad.wmnet
[19:48:20] <dr0ptp4kt>	   - dbproxy1019.eqiad.wmnet
[19:48:20] <dr0ptp4kt>	   - dbproxy1018.eqiad.wmnet
[22:13:03] <dr0ptp4kt>	 (I'm not sure why dbproxy1018.eqiad.wmnet is listed twice in the .erb, maybe it was to try to force a weighting or something, as https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/wmcs/db/wikireplicas/querysampler.py$75-77 just iterates through the 'hosts' entries. So that is something I'd need to change and I would also need to make it iterate over the ports for all shards.)
[22:14:53] <dr0ptp4kt>	 (I'm thinking I could just screen a manually executed -d call of this script after making the modifications, unless there's some reason we really need to daemonize again.)
[22:25:29] <dr0ptp4kt>	 I guess another approach could be enabling server_audit, but that seems more involved, at least in the short run. Heading out, have a good weekend.