[00:00:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:10:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:11:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:15:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:20:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:20:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:21:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2040.codfw.wmnet with OS bullseye
[00:21:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2040.codfw.wmnet with OS bullseye completed: - kubernetes2040 (**PASS*...
[00:21:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul)
[00:25:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:35:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:38:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869
[00:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869 (owner: 10TrainBranchBot)
[00:40:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:42:22] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:50:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:24:50] <wikibugs>	 (03PS4) 10TTO: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245)
[01:25:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:26:02] <tto>	 hi all!
[01:26:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869 (owner: 10TrainBranchBot)
[01:26:04] <tto>	 Long time no see
[01:26:15] <tto>	 Any chance of a look at https://gerrit.wikimedia.org/r/668156/ ?
[01:26:23] <tto>	 This affects beta only - does it need to be added to a deployment window?
[01:26:28] <tto>	 Or can be merged ad hoc?
[01:30:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:35:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:50:57] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:55:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[01:56:03] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:07:27] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[02:10:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:11:03] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:17:14] <icinga-wm>	 PROBLEM - Check systemd state on backup2003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:30:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:31:03] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[02:34:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[02:34:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51111 and previous config saved to /var/cache/conftool/dbconfig/20230824-023407-ladsgroup.json
[02:35:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1178.eqiad.wmnet with reason: Host needs maint
[02:35:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1178.eqiad.wmnet with reason: Host needs maint
[02:35:57] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:39:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51112 and previous config saved to /var/cache/conftool/dbconfig/20230824-023924-ladsgroup.json
[02:40:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:42:54] <icinga-wm>	 RECOVERY - Check systemd state on backup2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:57] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[02:48:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[02:48:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[02:54:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51113 and previous config saved to /var/cache/conftool/dbconfig/20230824-025431-ladsgroup.json
[02:55:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:05:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51114 and previous config saved to /var/cache/conftool/dbconfig/20230824-030937-ladsgroup.json
[03:09:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:09:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:10:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:12:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 6.847 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:13:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:15:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:20:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:21:05] <wikibugs>	 (03CR) 10Krinkle: "Feel free to schedule for Backport deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ)
[03:24:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51115 and previous config saved to /var/cache/conftool/dbconfig/20230824-032443-ladsgroup.json
[03:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[03:25:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[03:25:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51116 and previous config saved to /var/cache/conftool/dbconfig/20230824-032508-ladsgroup.json
[03:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[03:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[03:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51117 and previous config saved to /var/cache/conftool/dbconfig/20230824-032545-ladsgroup.json
[03:25:50] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[03:25:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:26:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51118 and previous config saved to /var/cache/conftool/dbconfig/20230824-032633-ladsgroup.json
[03:30:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:32:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[03:32:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[03:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51119 and previous config saved to /var/cache/conftool/dbconfig/20230824-033240-ladsgroup.json
[03:39:11] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881)
[03:39:16] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/951871 (https://phabricator.wikimedia.org/T344881)
[03:40:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[03:40:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[03:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51120 and previous config saved to /var/cache/conftool/dbconfig/20230824-034056-ladsgroup.json
[03:45:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:47:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P51121 and previous config saved to /var/cache/conftool/dbconfig/20230824-034747-ladsgroup.json
[03:48:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51122 and previous config saved to /var/cache/conftool/dbconfig/20230824-034815-ladsgroup.json
[03:55:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[03:57:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[03:57:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[04:00:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:01:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[04:01:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[04:01:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51123 and previous config saved to /var/cache/conftool/dbconfig/20230824-040139-ladsgroup.json
[04:02:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P51124 and previous config saved to /var/cache/conftool/dbconfig/20230824-040253-ladsgroup.json
[04:03:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P51125 and previous config saved to /var/cache/conftool/dbconfig/20230824-040321-ladsgroup.json
[04:05:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:06:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51126 and previous config saved to /var/cache/conftool/dbconfig/20230824-040656-ladsgroup.json
[04:07:02] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[04:08:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51127 and previous config saved to /var/cache/conftool/dbconfig/20230824-040808-ladsgroup.json
[04:10:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:14:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[04:14:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[04:14:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51128 and previous config saved to /var/cache/conftool/dbconfig/20230824-041421-ladsgroup.json
[04:14:26] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[04:15:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance
[04:15:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance
[04:15:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51129 and previous config saved to /var/cache/conftool/dbconfig/20230824-041537-ladsgroup.json
[04:15:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:18:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51130 and previous config saved to /var/cache/conftool/dbconfig/20230824-041759-ladsgroup.json
[04:18:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P51131 and previous config saved to /var/cache/conftool/dbconfig/20230824-041827-ladsgroup.json
[04:21:22] <wikibugs>	 (03PS1) 10Ladsgroup: db1178: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/952036 (https://phabricator.wikimedia.org/T344880)
[04:22:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P51132 and previous config saved to /var/cache/conftool/dbconfig/20230824-042202-ladsgroup.json
[04:23:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P51133 and previous config saved to /var/cache/conftool/dbconfig/20230824-042314-ladsgroup.json
[04:27:40] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Marostegui)
[04:27:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51134 and previous config saved to /var/cache/conftool/dbconfig/20230824-042740-ladsgroup.json
[04:28:23] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] db1178: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/952036 (https://phabricator.wikimedia.org/T344880) (owner: 10Ladsgroup)
[04:33:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51135 and previous config saved to /var/cache/conftool/dbconfig/20230824-043334-ladsgroup.json
[04:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51136 and previous config saved to /var/cache/conftool/dbconfig/20230824-043619-ladsgroup.json
[04:37:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P51137 and previous config saved to /var/cache/conftool/dbconfig/20230824-043709-ladsgroup.json
[04:38:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P51138 and previous config saved to /var/cache/conftool/dbconfig/20230824-043820-ladsgroup.json
[04:39:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "IMO, it's good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[04:42:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P51139 and previous config saved to /var/cache/conftool/dbconfig/20230824-044247-ladsgroup.json
[04:51:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51140 and previous config saved to /var/cache/conftool/dbconfig/20230824-045125-ladsgroup.json
[04:51:57] <wikibugs>	 (03PS2) 10Ladsgroup: Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683)
[04:52:12] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[04:52:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51141 and previous config saved to /var/cache/conftool/dbconfig/20230824-045215-ladsgroup.json
[04:52:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[04:52:20] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[04:52:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[04:52:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[04:52:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51142 and previous config saved to /var/cache/conftool/dbconfig/20230824-045236-ladsgroup.json
[04:52:54] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[04:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51143 and previous config saved to /var/cache/conftool/dbconfig/20230824-045326-ladsgroup.json
[04:53:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[04:53:38] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]]
[04:53:42] <stashbot>	 T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683
[04:53:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[04:53:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51144 and previous config saved to /var/cache/conftool/dbconfig/20230824-045352-ladsgroup.json
[04:54:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51145 and previous config saved to /var/cache/conftool/dbconfig/20230824-045447-ladsgroup.json
[04:55:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51146 and previous config saved to /var/cache/conftool/dbconfig/20230824-045504-ladsgroup.json
[04:55:16] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[04:56:13] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[04:57:22] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:57:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P51147 and previous config saved to /var/cache/conftool/dbconfig/20230824-045753-ladsgroup.json
[04:58:48] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51148 and previous config saved to /var/cache/conftool/dbconfig/20230824-050137-ladsgroup.json
[05:01:54] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]] (duration: 08m 16s)
[05:01:59] <stashbot>	 T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683
[05:06:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51149 and previous config saved to /var/cache/conftool/dbconfig/20230824-050632-ladsgroup.json
[05:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P51150 and previous config saved to /var/cache/conftool/dbconfig/20230824-050953-ladsgroup.json
[05:10:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P51151 and previous config saved to /var/cache/conftool/dbconfig/20230824-051010-ladsgroup.json
[05:13:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51152 and previous config saved to /var/cache/conftool/dbconfig/20230824-051259-ladsgroup.json
[05:16:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P51153 and previous config saved to /var/cache/conftool/dbconfig/20230824-051644-ladsgroup.json
[05:19:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881
[05:19:22] <stashbot>	 T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881
[05:19:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881
[05:19:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1138 with weight 0 T344881', diff saved to https://phabricator.wikimedia.org/P51154 and previous config saved to /var/cache/conftool/dbconfig/20230824-051951-ladsgroup.json
[05:21:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51155 and previous config saved to /var/cache/conftool/dbconfig/20230824-052138-ladsgroup.json
[05:21:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[05:21:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[05:21:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:22:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:22:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51156 and previous config saved to /var/cache/conftool/dbconfig/20230824-052208-ladsgroup.json
[05:25:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P51157 and previous config saved to /var/cache/conftool/dbconfig/20230824-052459-ladsgroup.json
[05:25:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P51158 and previous config saved to /var/cache/conftool/dbconfig/20230824-052517-ladsgroup.json
[05:25:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Tgr) According to the [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ad097dd6ef45fad3612ca33371f5c478870fbaa6/modules/swift/templates/proxy-s...
[05:28:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51159 and previous config saved to /var/cache/conftool/dbconfig/20230824-052829-ladsgroup.json
[05:30:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[05:31:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P51160 and previous config saved to /var/cache/conftool/dbconfig/20230824-053150-ladsgroup.json
[05:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:40:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51161 and previous config saved to /var/cache/conftool/dbconfig/20230824-054005-ladsgroup.json
[05:40:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[05:40:12] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[05:40:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[05:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:40:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51162 and previous config saved to /var/cache/conftool/dbconfig/20230824-054023-ladsgroup.json
[05:40:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[05:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:40:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51163 and previous config saved to /var/cache/conftool/dbconfig/20230824-054033-ladsgroup.json
[05:40:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[05:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51164 and previous config saved to /var/cache/conftool/dbconfig/20230824-054044-ladsgroup.json
[05:40:58] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[05:42:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51165 and previous config saved to /var/cache/conftool/dbconfig/20230824-054244-ladsgroup.json
[05:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P51166 and previous config saved to /var/cache/conftool/dbconfig/20230824-054335-ladsgroup.json
[05:46:23] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot)
[05:46:28] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot)
[05:46:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51167 and previous config saved to /var/cache/conftool/dbconfig/20230824-054656-ladsgroup.json
[05:47:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[05:47:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[05:47:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[05:47:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[05:47:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51168 and previous config saved to /var/cache/conftool/dbconfig/20230824-054726-ladsgroup.json
[05:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:50:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:50:58] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[05:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:53:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:54:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:55:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51169 and previous config saved to /var/cache/conftool/dbconfig/20230824-055511-ladsgroup.json
[05:57:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P51170 and previous config saved to /var/cache/conftool/dbconfig/20230824-055750-ladsgroup.json
[05:58:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[05:58:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[05:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P51171 and previous config saved to /var/cache/conftool/dbconfig/20230824-055842-ladsgroup.json
[05:58:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51172 and previous config saved to /var/cache/conftool/dbconfig/20230824-055846-ladsgroup.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0600).
[06:00:11] <Amir1>	 o/
[06:00:15] <Amir1>	 about to switchover s4
[06:00:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[06:01:47] <Amir1>	 !log Starting s4 eqiad failover from db1160 to db1138 - T344881
[06:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:52] <stashbot>	 T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881
[06:01:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T344881', diff saved to https://phabricator.wikimedia.org/P51173 and previous config saved to /var/cache/conftool/dbconfig/20230824-060157-ladsgroup.json
[06:02:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T344881', diff saved to https://phabricator.wikimedia.org/P51174 and previous config saved to /var/cache/conftool/dbconfig/20230824-060245-ladsgroup.json
[06:04:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/951871 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot)
[06:04:44] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Urbanecm_WMF) Thanks for the info, @Tgr! I [fixed](https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/bf3e99977c63c4b65bfd211d3fd960e7700f5d5f%...
[06:05:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[06:06:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1160 T344881', diff saved to https://phabricator.wikimedia.org/P51175 and previous config saved to /var/cache/conftool/dbconfig/20230824-060647-ladsgroup.json
[06:06:54] <stashbot>	 T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881
[06:08:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Urbanecm_WMF) 05Open→03Resolved p:05Triage→03High a:03Urbanecm_WMF Boldly resolving.
[06:09:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[06:09:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[06:09:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51176 and previous config saved to /var/cache/conftool/dbconfig/20230824-060924-ladsgroup.json
[06:09:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:09:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:10:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P51177 and previous config saved to /var/cache/conftool/dbconfig/20230824-061017-ladsgroup.json
[06:12:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P51178 and previous config saved to /var/cache/conftool/dbconfig/20230824-061256-ladsgroup.json
[06:13:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51179 and previous config saved to /var/cache/conftool/dbconfig/20230824-061348-ladsgroup.json
[06:13:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[06:14:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[06:14:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51180 and previous config saved to /var/cache/conftool/dbconfig/20230824-061413-ladsgroup.json
[06:14:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:15:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[06:17:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51181 and previous config saved to /var/cache/conftool/dbconfig/20230824-061748-ladsgroup.json
[06:18:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[06:18:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[06:18:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51182 and previous config saved to /var/cache/conftool/dbconfig/20230824-061813-ladsgroup.json
[06:20:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[06:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51183 and previous config saved to /var/cache/conftool/dbconfig/20230824-062127-ladsgroup.json
[06:21:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51184 and previous config saved to /var/cache/conftool/dbconfig/20230824-062143-ladsgroup.json
[06:21:48] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[06:24:31] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:25:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P51185 and previous config saved to /var/cache/conftool/dbconfig/20230824-062523-ladsgroup.json
[06:26:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51186 and previous config saved to /var/cache/conftool/dbconfig/20230824-062645-ladsgroup.json
[06:27:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:27:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[06:28:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51187 and previous config saved to /var/cache/conftool/dbconfig/20230824-062802-ladsgroup.json
[06:28:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[06:28:07] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[06:28:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[06:28:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51188 and previous config saved to /var/cache/conftool/dbconfig/20230824-062824-ladsgroup.json
[06:29:31] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:30:51] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883)
[06:31:03] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:31:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883
[06:31:24] <stashbot>	 T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883
[06:31:36] <wikibugs>	 (03PS1) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045
[06:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883
[06:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2179 with weight 0 T344883', diff saved to https://phabricator.wikimedia.org/P51189 and previous config saved to /var/cache/conftool/dbconfig/20230824-063240-ladsgroup.json
[06:32:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P51190 and previous config saved to /var/cache/conftool/dbconfig/20230824-063255-ladsgroup.json
[06:36:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P51191 and previous config saved to /var/cache/conftool/dbconfig/20230824-063633-ladsgroup.json
[06:36:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P51192 and previous config saved to /var/cache/conftool/dbconfig/20230824-063649-ladsgroup.json
[06:40:22] <Amir1>	 !log killed mwscript updateSpecialPages.php metawiki --override --only=Mostlinked blocking db depool
[06:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51193 and previous config saved to /var/cache/conftool/dbconfig/20230824-064030-ladsgroup.json
[06:40:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[06:40:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[06:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51194 and previous config saved to /var/cache/conftool/dbconfig/20230824-064044-ladsgroup.json
[06:40:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[06:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P51195 and previous config saved to /var/cache/conftool/dbconfig/20230824-064152-ladsgroup.json
[06:42:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51196 and previous config saved to /var/cache/conftool/dbconfig/20230824-064205-ladsgroup.json
[06:42:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:42:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org
[06:48:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P51197 and previous config saved to /var/cache/conftool/dbconfig/20230824-064801-ladsgroup.json
[06:48:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org
[06:48:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51198 and previous config saved to /var/cache/conftool/dbconfig/20230824-064830-ladsgroup.json
[06:51:04] <wikibugs>	 (03CR) 10Muehlenhoff: Make nftables::service types more compatible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:51:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P51199 and previous config saved to /var/cache/conftool/dbconfig/20230824-065140-ladsgroup.json
[06:51:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P51200 and previous config saved to /var/cache/conftool/dbconfig/20230824-065155-ladsgroup.json
[06:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:55:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org
[06:56:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P51201 and previous config saved to /var/cache/conftool/dbconfig/20230824-065658-ladsgroup.json
[06:57:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Fix networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:57:46] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] modules/base: networkpolicy_1.0.1 Add support for extraRules [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:57:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:59:06] <wikibugs>	 (03Merged) 10jenkins-bot: modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:59:08] <wikibugs>	 (03Merged) 10jenkins-bot: modules/base: networkpolicy_1.0.1 Add support for extraRules [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:59:10] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Fix networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[06:59:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:59:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org
[07:00:04] <jouncebot>	 Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0700). Please do the needful.
[07:00:04] <jouncebot>	 tto and kizule: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] <apergos>	 morning!
[07:00:19] <tto>	 Kizule hi there!
[07:00:24] <Kizule>	 Hi!
[07:00:24] <tto>	 apergos g'day!
[07:00:44] <tto>	 My patch is 2.5 years old, please treat it gently
[07:00:49] <apergos>	 we have a trainee signed up for today to learn how to deploy. I'll wait for them to show up in google meet (I don't have their irc nick to ping them here).
[07:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:01:15] <apergos>	 are either of you, tto and Kizule, self-deployers or will you need our assistance today?
[07:01:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[07:01:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[07:01:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org
[07:01:34] <tto>	 No, I'll be needing your assistance
[07:01:54] <Kizule>	 apergos: What you mean by assistance? I don't have access to any of servers. ;)
[07:02:19] <apergos>	 then we'll be doing the deployment, and asking you to test at a couple of points during the process. all good!
[07:02:35] <tto>	 Just fyi, I'm on a slightly unstable connection, so if I disappear I'll reconnect asap
[07:02:39] <Kizule>	 okay :)
[07:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51202 and previous config saved to /var/cache/conftool/dbconfig/20230824-070307-ladsgroup.json
[07:03:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance
[07:03:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance
[07:03:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51203 and previous config saved to /var/cache/conftool/dbconfig/20230824-070332-ladsgroup.json
[07:03:33] <apergos>	 tto:  I notice that you have a cr -1 about an issue which I assume was addressed in the latest patchset. however if you could get a cr on that before I deploy, that would be good.
[07:03:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P51204 and previous config saved to /var/cache/conftool/dbconfig/20230824-070343-ladsgroup.json
[07:04:01] <tto>	 Yes, I did address that, the CR was asking me to add the extension to wmf-config/extension-list, which I did
[07:04:09] <apergos>	 Kizule: your patch looks good to go, as soon as our trainee arrives, or after 5 more minutes, whichever comes first :-)
[07:04:17] <tto>	 Who would one get CR from at this hour? I'm out of the loop on these things
[07:04:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Pool es2025', diff saved to https://phabricator.wikimedia.org/P51205 and previous config saved to /var/cache/conftool/dbconfig/20230824-070417-ladsgroup.json
[07:04:53] <Kizule>	 I can give +1. ;)
[07:05:02] <apergos>	 tto:  deployers running the window aren't really supposed to be doing cr, we would expect patches to come to us with +1 on them alredy, though there is some discussion as to whether that should apply to config patches, see here: https://phabricator.wikimedia.org/T344409 
[07:05:25] <RhinosF1>	 Reedy: you awake yet?
[07:05:36] <apergos>	 it's pretty early for him I think
[07:05:47] <tto>	 A live discussion on that task, I see
[07:05:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org
[07:05:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[07:06:30] <tto>	 The deploy is not urgent, i'm happy to wait for another time if you'd prefer. I note that this is a low-risk patch, as it only touches beta cluster, but your call in the end
[07:06:41] <tto>	 Reedy is in UK right? He'd likely be asleep
[07:06:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51206 and previous config saved to /var/cache/conftool/dbconfig/20230824-070646-ladsgroup.json
[07:06:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[07:07:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51207 and previous config saved to /var/cache/conftool/dbconfig/20230824-070702-ladsgroup.json
[07:07:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[07:07:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[07:07:07] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[07:07:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51208 and previous config saved to /var/cache/conftool/dbconfig/20230824-070710-ladsgroup.json
[07:07:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[07:07:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51209 and previous config saved to /var/cache/conftool/dbconfig/20230824-070723-ladsgroup.json
[07:08:18] <apergos>	 tto:  if no one comes along who can give a meaningful +1 (I couldn't, for example) in time for the morning window, then yes if you don't mind, I'd ask you to wait.  and thanks for being understanding about it.
[07:08:23] <RhinosF1>	 It's 8am. He's UK like me I think.
[07:08:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[07:08:45] <tto>	 As a general comment (maybe I should add it to the task), the experience of getting things done as a volunteer (in my case, getting a new extension deployed) is already incredibly difficult if you "don't know the right people", so I'd not support anything that would add hurdles to that experience
[07:08:59] <apergos>	 in two minutes if our trainee has not shown up, I'll proceed with your patch, Kizule
[07:09:01] <tto>	 Anyway if anyone is able to CR, great, otherwise, let's leave it for now
[07:09:02] <RhinosF1>	 Sadly I have to go straight into busy at work so can't help
[07:09:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:09:28] <apergos>	 tto:   I totally understand, and yes, you should comment right on the task where other people will see it
[07:09:36] <Kizule>	 apergos: Sounds good
[07:09:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51210 and previous config saved to /var/cache/conftool/dbconfig/20230824-070946-ladsgroup.json
[07:09:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[07:10:49] <apergos>	 *ding*  our trainee is late or has not got the date right, so I will proceed
[07:11:06] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177)
[07:11:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: 10Zoranzoki21)
[07:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51211 and previous config saved to /var/cache/conftool/dbconfig/20230824-071204-ladsgroup.json
[07:12:15] <wikibugs>	 (03Merged) 10jenkins-bot: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: 10Zoranzoki21)
[07:12:44] <wikibugs>	 (03PS1) 10Ayounsi: Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579)
[07:13:05] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]]
[07:13:10] <stashbot>	 T344816: Delete the Index namespace at English Wiktionary - https://phabricator.wikimedia.org/T344816
[07:13:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51212 and previous config saved to /var/cache/conftool/dbconfig/20230824-071323-ladsgroup.json
[07:13:29] <wikibugs>	 (03PS1) 10Aklapper: phabricator: Stop logging Bugzilla redirector misses [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884)
[07:14:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3003.wikimedia.org
[07:14:40] <logmsgbot>	 !log ariel@deploy1002 zoranzoki21 and ariel: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:14:57] <apergos>	 Kizule:  your change is live on mwdebug1002, please test it there 
[07:14:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[07:15:15] <Kizule>	 apergos: Okay, I'll keep you updated.
[07:15:22] <apergos>	 great!
[07:15:38] <wikibugs>	 (03PS1) 10Ayounsi: Homer-public: remove mentions of knams [homer/public] - 10https://gerrit.wikimedia.org/r/952048 (https://phabricator.wikimedia.org/T344579)
[07:16:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff)
[07:16:59] <Kizule>	 apergos: Good to go
[07:17:19] <apergos>	 okay, proceeding. 
[07:17:24] <logmsgbot>	 !log ariel@deploy1002 zoranzoki21 and ariel: Continuing with sync
[07:17:28] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm)
[07:17:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51213 and previous config saved to /var/cache/conftool/dbconfig/20230824-071757-ladsgroup.json
[07:18:04] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[07:18:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P51214 and previous config saved to /var/cache/conftool/dbconfig/20230824-071849-ladsgroup.json
[07:19:25] <wikibugs>	 (03PS1) 10Ayounsi: netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579)
[07:21:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3003.wikimedia.org
[07:21:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Homer-public: remove mentions of knams [homer/public] - 10https://gerrit.wikimedia.org/r/952048 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[07:21:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[07:21:57] <wikibugs>	 (03Merged) 10jenkins-bot: netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[07:22:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[07:22:56] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[07:23:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Deool es2025', diff saved to https://phabricator.wikimedia.org/P51215 and previous config saved to /var/cache/conftool/dbconfig/20230824-072301-ladsgroup.json
[07:23:07] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]] (duration: 10m 01s)
[07:23:12] <apergos>	 Kizule:  your change is now live in production, please test it there :-) 
[07:23:14] <stashbot>	 T344816: Delete the Index namespace at English Wiktionary - https://phabricator.wikimedia.org/T344816
[07:24:12] <Kizule>	 Looks good, thank you!
[07:24:18] <apergos>	 great!
[07:24:30] <tto>	 thanks Kizule and apergos! (this was actually a task I filed :) )
[07:24:39] <apergos>	 sweet!
[07:24:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P51216 and previous config saved to /var/cache/conftool/dbconfig/20230824-072453-ladsgroup.json
[07:25:11] <Kizule>	 You're welcome. apergos: Can you give me link to page where is training mentioned? I can't find it.
[07:25:21] <apergos>	 sure!
[07:25:35] <apergos>	 https://wikitech.wikimedia.org/wiki/Deployments/Training
[07:25:39] <apergos>	 this talks about it
[07:26:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org
[07:26:15] <Kizule>	 apergos: Thanks, I wanted that one.
[07:26:24] <apergos>	 if you want to get trained (which I recommend, whether I'm doing it or someone else, we always want more deployers!), you can sign up by making a phab task here: https://phabricator.wikimedia.org/project/board/5265/
[07:26:49] <apergos>	 I mean, tag it with that and it will go right into the backlog for someone to set it up with you.
[07:27:21] <Kizule>	 Yes, that's why I asked. I'm already working on creating a task per instructions from page on Wikitech.
[07:27:24] <Kizule>	 Thanks!
[07:27:38] <apergos>	 excellent! maybe I'll see you at one of these sessions as a trainee.  
[07:28:04] <apergos>	 tto:  I'm happpy to keep the window open for awhile yet, in case Reedy or someone else shows up who would do that +1
[07:28:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P51217 and previous config saved to /var/cache/conftool/dbconfig/20230824-072829-ladsgroup.json
[07:29:08] <tto>	 Thanks for offering apergos, but all good. Rather than waiting around, let's both go and enjoy our days!
[07:29:23] <apergos>	 ok!  see everyone next time, have a great rest of your day!
[07:30:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org
[07:30:23] <apergos>	 !log UTC morning backport and config deployment window complete
[07:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org
[07:33:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P51218 and previous config saved to /var/cache/conftool/dbconfig/20230824-073304-ladsgroup.json
[07:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51219 and previous config saved to /var/cache/conftool/dbconfig/20230824-073355-ladsgroup.json
[07:35:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org
[07:36:56] <wikibugs>	 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10Kizule)
[07:37:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org
[07:38:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[07:39:05] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[07:39:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[07:39:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[07:39:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[07:39:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P51220 and previous config saved to /var/cache/conftool/dbconfig/20230824-073959-ladsgroup.json
[07:41:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[07:41:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[07:41:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org
[07:42:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51221 and previous config saved to /var/cache/conftool/dbconfig/20230824-074216-ladsgroup.json
[07:42:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[07:42:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883
[07:42:40] <stashbot>	 T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883
[07:43:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883
[07:43:04] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Add jaeger user to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[07:43:32] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497)
[07:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P51222 and previous config saved to /var/cache/conftool/dbconfig/20230824-074336-ladsgroup.json
[07:43:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[07:44:06] <wikibugs>	 (03PS1) 10JMeybohm: Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253)
[07:45:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[07:46:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:46:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Use ClusterConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951046
[07:46:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047
[07:46:49] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048
[07:46:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049
[07:47:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51223 and previous config saved to /var/cache/conftool/dbconfig/20230824-074708-ladsgroup.json
[07:47:14] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[07:48:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P51224 and previous config saved to /var/cache/conftool/dbconfig/20230824-074810-ladsgroup.json
[07:48:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[07:49:08] <wikibugs>	 (03Abandoned) 10JMeybohm: Revert "aux: add grpc/http ports for jaeger collector" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 (owner: 10Filippo Giunchedi)
[07:49:17] <wikibugs>	 (03PS2) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497)
[07:50:08] <wikibugs>	 (03CR) 10Ladsgroup: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:50:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51225 and previous config saved to /var/cache/conftool/dbconfig/20230824-075028-ladsgroup.json
[07:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[07:50:59] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[07:51:05] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[07:53:17] <wikibugs>	 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10fgiunchedi) >>! In T342755#9114368, @thcipriani wrote: > Hrm. We get an email from the systemd timer for this, so the alert is probably not necessary. >  > We're not very familiar...
[07:54:41] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) +1 on my end FWIW
[07:54:51] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:55:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51226 and previous config saved to /var/cache/conftool/dbconfig/20230824-075505-ladsgroup.json
[07:55:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[07:55:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[07:55:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51227 and previous config saved to /var/cache/conftool/dbconfig/20230824-075529-ladsgroup.json
[07:56:36] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047
[07:56:38] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048
[07:56:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049
[07:57:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:57:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:57:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51228 and previous config saved to /var/cache/conftool/dbconfig/20230824-075722-ladsgroup.json
[07:57:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:57:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[07:58:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51229 and previous config saved to /var/cache/conftool/dbconfig/20230824-075842-ladsgroup.json
[07:58:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[07:59:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[07:59:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51230 and previous config saved to /var/cache/conftool/dbconfig/20230824-075906-ladsgroup.json
[08:00:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[08:00:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) The new version of the script have been deployed, but not ye...
[08:01:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet
[08:01:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:02:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P51231 and previous config saved to /var/cache/conftool/dbconfig/20230824-080214-ladsgroup.json
[08:02:34] <wikibugs>	 (03PS1) 10JMeybohm: jaeger: Fix path to helmfile-defaults secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/952111 (https://phabricator.wikimedia.org/T344253)
[08:02:54] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] jaeger: Fix path to helmfile-defaults secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/952111 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[08:03:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51232 and previous config saved to /var/cache/conftool/dbconfig/20230824-080316-ladsgroup.json
[08:03:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:03:22] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[08:03:23] <wikibugs>	 (03PS2) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361)
[08:03:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:03:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:05:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51233 and previous config saved to /var/cache/conftool/dbconfig/20230824-080522-ladsgroup.json
[08:05:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:05:35] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:05:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P51234 and previous config saved to /var/cache/conftool/dbconfig/20230824-080534-ladsgroup.json
[08:06:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[08:06:23] <wikibugs>	 (03PS3) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361)
[08:07:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet
[08:07:11] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[08:07:13] <wikibugs>	 (03CR) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:07:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers thanos-fe1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:07:41] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:07:51] <icinga-wm>	 PROBLEM - SSH on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:07:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers thanos-fe1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:08:11] <godog>	 sigh, that's me causing a thanos timeout by running a query :(
[08:08:20] <godog>	 should recover by itself
[08:08:28] <slyngs>	 Write better queries :-)
[08:09:21] <godog>	 slyngs: haha!
[08:09:26] * godog frantically hits refresh
[08:09:27] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:09:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org
[08:10:05] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 9.669 second response time https://wikitech.wikimedia.org/wiki/Thanos
[08:10:11] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:10:23] <slyngs>	 Seems funky that a query can cause a tls expiry alert :-)
[08:10:25] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:10:35] <icinga-wm>	 RECOVERY - SSH on thanos-fe1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto)
[08:10:45] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on thanos-fe1004 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 21 Jul 2025 03:04:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:10:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:11:26] <godog>	 more of a case of the silly check that fires on timeouts
[08:11:30] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883) (owner: 10Gerrit maintenance bot)
[08:11:53] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883) (owner: 10Gerrit maintenance bot)
[08:12:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51235 and previous config saved to /var/cache/conftool/dbconfig/20230824-081229-ladsgroup.json
[08:13:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[08:14:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org
[08:14:14] <Amir1>	 !log Starting s4 codfw failover from db2140 to db2179 - T344883
[08:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:19] <stashbot>	 T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883
[08:14:20] <wikibugs>	 (03Merged) 10jenkins-bot: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert)
[08:14:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2179 to s4 primary T344883', diff saved to https://phabricator.wikimedia.org/P51236 and previous config saved to /var/cache/conftool/dbconfig/20230824-081442-ladsgroup.json
[08:15:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org
[08:16:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2140 T344883', diff saved to https://phabricator.wikimedia.org/P51237 and previous config saved to /var/cache/conftool/dbconfig/20230824-081654-ladsgroup.json
[08:17:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P51238 and previous config saved to /var/cache/conftool/dbconfig/20230824-081720-ladsgroup.json
[08:17:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply
[08:17:50] <wikibugs>	 (03PS1) 10Slyngshede: C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480)
[08:18:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[08:19:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529)
[08:19:28] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:19:33] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[08:19:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: add bandaid alert for prometheus not reloading its k8s certs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi)
[08:20:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org
[08:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[08:20:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[08:21:37] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10taavi) Obligatory reading: {T282786} (2021)  Backports: https://gerrit.wikimedia.org/r/q/owner:Zoranzoki21+-branch:master Config changes: https://gerrit.wikimedia....
[08:22:47] <taavi>	 jouncebot: nowandnext
[08:22:48] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 37 minute(s)
[08:22:48] <jouncebot>	 In 1 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000)
[08:22:48] <jouncebot>	 In 1 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000)
[08:23:20] <wikibugs>	 (03PS2) 10Slyngshede: C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480)
[08:23:23] <wikibugs>	 (03PS3) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031)
[08:24:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[08:24:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117
[08:25:03] <wikibugs>	 (03Merged) 10jenkins-bot: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[08:25:31] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]]
[08:25:39] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43000/console" [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[08:25:40] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[08:26:40] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[08:27:05] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:27:36] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[08:27:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Change db2179 groups', diff saved to https://phabricator.wikimedia.org/P51239 and previous config saved to /var/cache/conftool/dbconfig/20230824-082742-ladsgroup.json
[08:27:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51240 and previous config saved to /var/cache/conftool/dbconfig/20230824-082748-ladsgroup.json
[08:27:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[08:27:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P51241 and previous config saved to /var/cache/conftool/dbconfig/20230824-082757-ladsgroup.json
[08:28:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[08:28:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51242 and previous config saved to /var/cache/conftool/dbconfig/20230824-082814-ladsgroup.json
[08:28:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi)
[08:28:57] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply
[08:29:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply
[08:30:07] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs and A:cp
[08:30:18] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply
[08:30:33] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs and A:cp
[08:30:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: Add recording rules for istio traffic on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[08:30:56] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply
[08:32:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51243 and previous config saved to /var/cache/conftool/dbconfig/20230824-083226-ladsgroup.json
[08:32:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[08:32:32] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[08:32:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[08:32:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51244 and previous config saved to /var/cache/conftool/dbconfig/20230824-083248-ladsgroup.json
[08:33:16] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]] (duration: 07m 45s)
[08:33:21] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[08:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:35:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51245 and previous config saved to /var/cache/conftool/dbconfig/20230824-083537-ladsgroup.json
[08:35:43] <wikibugs>	 (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[08:35:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[08:36:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[08:36:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[08:36:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51246 and previous config saved to /var/cache/conftool/dbconfig/20230824-083644-ladsgroup.json
[08:37:22] <wikibugs>	 (03PS4) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620)
[08:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:40:41] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43001/console" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51247 and previous config saved to /var/cache/conftool/dbconfig/20230824-084055-ladsgroup.json
[08:40:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[08:41:01] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[08:41:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[08:41:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[08:42:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[08:42:30] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51248 and previous config saved to /var/cache/conftool/dbconfig/20230824-084303-ladsgroup.json
[08:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P51249 and previous config saved to /var/cache/conftool/dbconfig/20230824-084304-ladsgroup.json
[08:50:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto)
[08:50:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P51250 and previous config saved to /var/cache/conftool/dbconfig/20230824-085044-ladsgroup.json
[08:51:37] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:52:55] <wikibugs>	 (03PS1) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095)
[08:53:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[08:55:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51251 and previous config saved to /var/cache/conftool/dbconfig/20230824-085551-ladsgroup.json
[08:56:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P51252 and previous config saved to /var/cache/conftool/dbconfig/20230824-085602-ladsgroup.json
[08:56:08] <wikibugs>	 (03PS2) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095)
[08:56:14] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[08:56:26] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:56:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[08:58:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:58:08] <wikibugs>	 (03PS3) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095)
[08:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51253 and previous config saved to /var/cache/conftool/dbconfig/20230824-085810-ladsgroup.json
[08:58:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[08:58:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[08:58:30] <wikibugs>	 (03CR) 10jenkins-bot: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[08:58:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51254 and previous config saved to /var/cache/conftool/dbconfig/20230824-085834-ladsgroup.json
[09:00:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[09:00:18] <wikibugs>	 (03PS4) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095)
[09:00:50] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1130 days) https://wikitech.wikimedia.org/wiki/Logs
[09:03:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:03:42] <wikibugs>	 (03PS1) 10Slyngshede: C:bigtop::hadoop Fix script path [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480)
[09:04:44] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:04] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43003/console" [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[09:05:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P51255 and previous config saved to /var/cache/conftool/dbconfig/20230824-090550-ladsgroup.json
[09:05:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[09:05:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51256 and previous config saved to /var/cache/conftool/dbconfig/20230824-090559-ladsgroup.json
[09:06:56] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:bigtop::hadoop Fix script path [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[09:08:59] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "Looks good. I'm happy to +1 after we sort out the duplication (or decide to ignore it)." [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[09:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P51257 and previous config saved to /var/cache/conftool/dbconfig/20230824-091057-ladsgroup.json
[09:10:58] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[09:11:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P51258 and previous config saved to /var/cache/conftool/dbconfig/20230824-091108-ladsgroup.json
[09:13:17] <wikibugs>	 (03PS1) 10Jbond: Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129
[09:13:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129 (owner: 10Jbond)
[09:13:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43004/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[09:15:21] <wikibugs>	 (03Abandoned) 10Jbond: Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129 (owner: 10Jbond)
[09:15:37] <jbond>	 slyngs: fyi yout patch is causing https://puppetboard.wikimedia.org/nodes?status=failed, working on fix now (cc btullis)
[09:16:04] <btullis>	 jbond: Thank you.
[09:17:39] <slyngs>	 I already fixed it
[09:17:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51259 and previous config saved to /var/cache/conftool/dbconfig/20230824-091741-ladsgroup.json
[09:17:47] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:17:50] <slyngs>	 jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952125/1/modules/bigtop/manifests/hadoop.pp
[09:18:20] <jbond>	 that explains why i cant see the proiblem 
[09:18:43] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) We're meeting with them in the next couple of weeks to troubleshoot our scraping problems. Will report back once w...
[09:19:20] <slyngs>	 To be fair neither could I, so I had to compare it to the beeline patch I did  earlier
[09:19:22] <jbond>	 slyngs: good to run the following once yu send a fix (running now)
[09:19:23] <jbond>	 sudo cumin -p0  -b 40 '*' 'run-puppet-agent  --failed-only -q'   
[09:19:28] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:20:06] <slyngs>	 Oh that's a lot of hosts
[09:20:15] <jbond>	 just needed to wait 5 more mins for the recovery :)
[09:20:32] <jbond>	 yes but it is a no op unless puppet has failed
[09:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51260 and previous config saved to /var/cache/conftool/dbconfig/20230824-092056-ladsgroup.json
[09:21:00] <jbond>	 in this case you could have used used C:bigtop::hadoop to limit things
[09:21:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[09:21:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P51261 and previous config saved to /var/cache/conftool/dbconfig/20230824-092105-ladsgroup.json
[09:21:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[09:21:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51262 and previous config saved to /var/cache/conftool/dbconfig/20230824-092122-ladsgroup.json
[09:21:30] <slyngs>	 Oh, yeah, that would have been faster... I'll just let it run
[09:21:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51263 and previous config saved to /var/cache/conftool/dbconfig/20230824-092147-ladsgroup.json
[09:21:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[09:23:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[09:25:51] <wikibugs>	 (03PS5) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620)
[09:25:57] <wikibugs>	 (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[09:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P51264 and previous config saved to /var/cache/conftool/dbconfig/20230824-092603-ladsgroup.json
[09:26:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51265 and previous config saved to /var/cache/conftool/dbconfig/20230824-092614-ladsgroup.json
[09:26:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:26:19] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:26:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:26:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51266 and previous config saved to /var/cache/conftool/dbconfig/20230824-092636-ladsgroup.json
[09:26:50] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:27:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[09:27:47] <slyngs>	 btullis: Sorry about that, should be all good now
[09:28:12] <btullis>	 All good, thanks <3
[09:28:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51267 and previous config saved to /var/cache/conftool/dbconfig/20230824-092846-ladsgroup.json
[09:28:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951155 (owner: 10Muehlenhoff)
[09:30:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51268 and previous config saved to /var/cache/conftool/dbconfig/20230824-093008-ladsgroup.json
[09:32:09] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet
[09:32:20] <wikibugs>	 (03PS6) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620)
[09:32:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P51269 and previous config saved to /var/cache/conftool/dbconfig/20230824-093247-ladsgroup.json
[09:33:23] <wikibugs>	 (03PS1) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814)
[09:35:13] <wikibugs>	 (03PS2) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814)
[09:35:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Nicely done! LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[09:36:10] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P51270 and previous config saved to /var/cache/conftool/dbconfig/20230824-093611-ladsgroup.json
[09:36:22] <wikibugs>	 (03PS3) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814)
[09:36:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet
[09:36:32] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[09:37:39] <wikibugs>	 (03PS1) 10Mvolz: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773)
[09:40:23] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[09:41:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51271 and previous config saved to /var/cache/conftool/dbconfig/20230824-094109-ladsgroup.json
[09:41:55] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet
[09:42:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[09:42:43] <moritzm>	 !log removed stretch-wikimedia from apt.wikimedia.org (obsolete)
[09:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P51272 and previous config saved to /var/cache/conftool/dbconfig/20230824-094352-ladsgroup.json
[09:44:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[09:45:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51273 and previous config saved to /var/cache/conftool/dbconfig/20230824-094515-ladsgroup.json
[09:45:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1002.eqiad.wmnet
[09:45:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet
[09:45:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[09:45:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[09:45:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[09:46:32] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::environment: Simplify environment variable export [puppet] - 10https://gerrit.wikimedia.org/r/952150
[09:47:07] <wikibugs>	 (03CR) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[09:47:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P51274 and previous config saved to /var/cache/conftool/dbconfig/20230824-094753-ladsgroup.json
[09:47:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253)
[09:49:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1002.eqiad.wmnet
[09:51:15] <wikibugs>	 (03PS1) 10JMeybohm: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253)
[09:51:17] <wikibugs>	 (03PS1) 10JMeybohm: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253)
[09:51:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51275 and previous config saved to /var/cache/conftool/dbconfig/20230824-095117-ladsgroup.json
[09:51:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[09:51:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[09:52:01] <fabfur>	 !log reboot lvs1020 to apply patch (T344587)
[09:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:52:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff)
[09:53:04] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:53:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove stretch Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/952154
[09:53:42] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[09:54:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet
[09:56:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove stretch Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/952154 (owner: 10Muehlenhoff)
[09:56:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:57:21] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet
[09:57:25] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet
[09:57:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:58:10] <wikibugs>	 (03PS1) 10Effie Mouzeli: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987)
[09:58:14] <wikibugs>	 (03PS5) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095)
[09:58:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P51276 and previous config saved to /var/cache/conftool/dbconfig/20230824-095858-ladsgroup.json
[09:59:07] <wikibugs>	 (03PS2) 10JMeybohm: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253)
[09:59:09] <wikibugs>	 (03PS2) 10JMeybohm: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253)
[09:59:45] <wikibugs>	 (03CR) 10Kamila Součková: "Thank you Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:00:05] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000)
[10:00:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51277 and previous config saved to /var/cache/conftool/dbconfig/20230824-100021-ladsgroup.json
[10:00:58] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet
[10:01:51] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773) (owner: 10Mvolz)
[10:02:02] <fabfur>	 !log end reboot of lvs1020 (pybal service enabled) (T344587)
[10:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773) (owner: 10Mvolz)
[10:02:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:03:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51278 and previous config saved to /var/cache/conftool/dbconfig/20230824-100259-ladsgroup.json
[10:03:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[10:03:04] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[10:03:13] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:03:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[10:03:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51279 and previous config saved to /var/cache/conftool/dbconfig/20230824-100321-ladsgroup.json
[10:03:33] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[10:03:53] <jinxer-wm>	 (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:04:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[10:04:42] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet
[10:05:40] <wikibugs>	 (03PS2) 10Effie Mouzeli: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987)
[10:05:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157
[10:06:13] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[10:06:43] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[10:07:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[10:07:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[10:07:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) (owner: 10Effie Mouzeli)
[10:08:15] <wikibugs>	 (03Merged) 10jenkins-bot: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[10:08:21] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[10:08:22] <wikibugs>	 (03Merged) 10jenkins-bot: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[10:08:53] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:08:56] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[10:09:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC says yes https://puppet-compiler.wmflabs.org/output/952121/43005/centrallog1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:11:35] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951867 (owner: 10PipelineBot)
[10:12:18] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951867 (owner: 10PipelineBot)
[10:12:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157 (owner: 10Muehlenhoff)
[10:13:29] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43007/console" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:14:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51280 and previous config saved to /var/cache/conftool/dbconfig/20230824-101405-ladsgroup.json
[10:14:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[10:14:11] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:14:31] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[10:14:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[10:14:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51281 and previous config saved to /var/cache/conftool/dbconfig/20230824-101437-ladsgroup.json
[10:14:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi)
[10:14:52] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[10:15:16] <effie>	 !log Disable puppet on thanos-fe (eqiad), rollout cfssl on thanos-fe in codfw
[10:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51282 and previous config saved to /var/cache/conftool/dbconfig/20230824-101527-ladsgroup.json
[10:15:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) (owner: 10Effie Mouzeli)
[10:16:06] <wikibugs>	 (03PS1) 10Clément Goubert: mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814)
[10:16:09] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814)
[10:16:15] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[10:16:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51283 and previous config saved to /var/cache/conftool/dbconfig/20230824-101647-ladsgroup.json
[10:16:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[10:17:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157 (owner: 10Muehlenhoff)
[10:17:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[10:17:53] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[10:18:36] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[10:18:40] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:20] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:20:06] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[10:21:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[10:21:48] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[10:22:11] <effie>	 !log pool kartotherian on codfw 
[10:22:13] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[10:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:39] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[10:23:00] <wikibugs>	 (03PS1) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959)
[10:24:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[10:24:43] <wikibugs>	 (03PS1) 10Muehlenhoff: haproxy: Simplify systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/952161
[10:25:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:25:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Make nftables::service types more compatible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:26:17] <wikibugs>	 (03PS1) 10Muehlenhoff: statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162
[10:26:22] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[10:28:07] <wikibugs>	 (03PS2) 10Clément Goubert: mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814)
[10:28:09] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814)
[10:28:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51284 and previous config saved to /var/cache/conftool/dbconfig/20230824-102848-ladsgroup.json
[10:29:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:31:13] <wikibugs>	 (03CR) 10Btullis: "I wonder about whether we need to notify any kafka-jumbo clients about the increase in maximum message size." [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[10:31:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P51285 and previous config saved to /var/cache/conftool/dbconfig/20230824-103153-ladsgroup.json
[10:32:46] <fabfur>	 !log stopping pybal and rebooting lvs1019 (T344587) 
[10:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:19] <wikibugs>	 (03CR) 10Btullis: "Do we need to apply this change in deployment-prep as well?" [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[10:34:59] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:39:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51286 and previous config saved to /var/cache/conftool/dbconfig/20230824-103948-ladsgroup.json
[10:39:54] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:42:41] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:43:53] <jinxer-wm>	 (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:43:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P51287 and previous config saved to /var/cache/conftool/dbconfig/20230824-104354-ladsgroup.json
[10:47:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P51288 and previous config saved to /var/cache/conftool/dbconfig/20230824-104659-ladsgroup.json
[10:48:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[10:49:16] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[10:49:48] <fabfur>	 ^^ expected
[10:49:54] <fabfur>	 this is me
[10:51:41] <wikibugs>	 (03PS1) 10Kamila Součková: benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095)
[10:51:46] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal
[10:53:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:54:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) >>! In T344547#9108360, @ayounsi wrote: > Some downsides I can think off: additional config, more complex to troubleshot (more prefixes in the routing t...
[10:54:31] <wikibugs>	 (03PS2) 10Kamila Součková: benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095)
[10:54:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I can't count the number of times this has bit me..." [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:54:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P51289 and previous config saved to /var/cache/conftool/dbconfig/20230824-105454-ladsgroup.json
[10:55:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:55:48] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:56:23] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:58:00] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43008/console" [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[10:59:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P51290 and previous config saved to /var/cache/conftool/dbconfig/20230824-105900-ladsgroup.json
[11:02:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51291 and previous config saved to /var/cache/conftool/dbconfig/20230824-110206-ladsgroup.json
[11:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[11:02:12] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:02:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[11:02:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51292 and previous config saved to /var/cache/conftool/dbconfig/20230824-110226-ladsgroup.json
[11:02:29] <godog>	 kamila_ _joe_ I was convinced CI would validate yaml in /files/ by itself, clearly I was misremembering 
[11:02:41] <godog>	 lunch, bbl
[11:02:57] <kamila_>	 apparently not :D 
[11:03:30] <kamila_>	 (that's a config file that happens to be yaml, not a puppet yaml file though... should it validate in that case?)
[11:03:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:03:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:05:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51293 and previous config saved to /var/cache/conftool/dbconfig/20230824-110537-ladsgroup.json
[11:09:44] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet
[11:10:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P51294 and previous config saved to /var/cache/conftool/dbconfig/20230824-111001-ladsgroup.json
[11:10:03] <jinxer-wm>	 (RedisMemoryFull) firing: (6) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[11:12:49] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet
[11:12:52] <icinga-wm>	 PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100%
[11:13:00] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp3074 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[11:13:02] <icinga-wm>	 RECOVERY - Host lvs1019 is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms
[11:13:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[11:13:34] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[11:14:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51295 and previous config saved to /var/cache/conftool/dbconfig/20230824-111407-ladsgroup.json
[11:14:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[11:14:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[11:14:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:14:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51296 and previous config saved to /var/cache/conftool/dbconfig/20230824-111432-ladsgroup.json
[11:15:00] <icinga-wm>	 RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[11:15:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[11:16:02] <fabfur>	 !log lvs1019 up and running (T344587) 
[11:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:31] <wikibugs>	 (03PS3) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814)
[11:17:52] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Generalize tls-proxy limits removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952171 (https://phabricator.wikimedia.org/T344814)
[11:18:54] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 81 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal
[11:20:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff)
[11:20:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P51297 and previous config saved to /var/cache/conftool/dbconfig/20230824-112043-ladsgroup.json
[11:20:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51298 and previous config saved to /var/cache/conftool/dbconfig/20230824-112052-ladsgroup.json
[11:23:22] <fabfur>	 varnishkafka-webrequest service is stopped on cp3074, is it something someone is working on? 
[11:23:58] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:25:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51299 and previous config saved to /var/cache/conftool/dbconfig/20230824-112507-ladsgroup.json
[11:25:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[11:25:13] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[11:25:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:25:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:25:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51300 and previous config saved to /var/cache/conftool/dbconfig/20230824-112532-ladsgroup.json
[11:26:02] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp3074 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[11:26:17] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951877
[11:28:42] <btullis>	 fabfur: Thanks. as per #wikimedi-sre I went ahead and started the varnishkafka-webrequest service on cp3074 
[11:28:53] <jinxer-wm>	 (RedisMemoryFull) firing: (6) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[11:29:00] <fabfur>	 btullis: thank you!!
[11:31:35] <taavi>	 !log foreachwikiindblist fishbowl extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php | tee oathauth-multiple-fishbowl.log # T242031
[11:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:40] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[11:32:46] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:35:04] <wikibugs>	 (03PS1) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952184 (https://phabricator.wikimedia.org/T242031)
[11:35:08] <wikibugs>	 (03PS1) 10Majavah: Set OATHAuth multiple devices READ_NEW for checkuser, techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952185 (https://phabricator.wikimedia.org/T242031)
[11:35:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P51301 and previous config saved to /var/cache/conftool/dbconfig/20230824-113550-ladsgroup.json
[11:35:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P51302 and previous config saved to /var/cache/conftool/dbconfig/20230824-113559-ladsgroup.json
[11:42:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff)
[11:43:08] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:48:30] <wikibugs>	 (03PS3) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497)
[11:48:32] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[11:49:11] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[11:50:14] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1614 days) https://wikitech.wikimedia.org/wiki/Logs
[11:50:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51303 and previous config saved to /var/cache/conftool/dbconfig/20230824-115056-ladsgroup.json
[11:50:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:51:01] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:51:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P51304 and previous config saved to /var/cache/conftool/dbconfig/20230824-115105-ladsgroup.json
[11:51:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:52:46] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:54:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:56:40] <wikibugs>	 (03PS1) 10Jaime Nuche: doc: rename user for rsyncing docs [puppet] - 10https://gerrit.wikimedia.org/r/952189
[12:00:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[12:00:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[12:02:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51305 and previous config saved to /var/cache/conftool/dbconfig/20230824-120218-ladsgroup.json
[12:02:24] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:02:50] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_drmrs and A:cp
[12:03:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[12:03:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[12:03:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51306 and previous config saved to /var/cache/conftool/dbconfig/20230824-120352-ladsgroup.json
[12:04:28] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904)
[12:06:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51307 and previous config saved to /var/cache/conftool/dbconfig/20230824-120611-ladsgroup.json
[12:06:31] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_drmrs and A:cp
[12:06:34] <wikibugs>	 (03PS1) 10Btullis: Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/952193 (https://phabricator.wikimedia.org/T344671)
[12:06:36] <wikibugs>	 (03PS1) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671)
[12:06:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[12:06:58] <wikibugs>	 (03PS1) 10Muehlenhoff: os-reports: Remove Stretch, add stub entry for Bullseye (data updates still needed) [puppet] - 10https://gerrit.wikimedia.org/r/952195
[12:07:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[12:07:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/952193 (https://phabricator.wikimedia.org/T344671) (owner: 10Btullis)
[12:08:21] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904)
[12:08:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:09:09] <wikibugs>	 (03PS1) 10Muehlenhoff: mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197
[12:09:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os-reports: Remove Stretch, add stub entry for Bullseye (data updates still needed) [puppet] - 10https://gerrit.wikimedia.org/r/952195 (owner: 10Muehlenhoff)
[12:09:59] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16378&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[12:10:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51308 and previous config saved to /var/cache/conftool/dbconfig/20230824-121024-ladsgroup.json
[12:10:30] <wikibugs>	 (03PS1) 10Muehlenhoff: etcd: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/952198
[12:11:21] <wikibugs>	 (03CR) 10Joal: [C: 03+1] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[12:11:23] <wikibugs>	 (03PS2) 10Muehlenhoff: os-reports: Remove Stretch, add stub entry for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/952195
[12:11:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[12:11:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[12:11:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51309 and previous config saved to /var/cache/conftool/dbconfig/20230824-121158-ladsgroup.json
[12:12:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert)
[12:13:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpd: fix ecs logging event duration format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199
[12:13:53] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[12:13:56] <claime>	 jouncebot: nowandnext
[12:13:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[12:13:57] <jouncebot>	 In 0 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[12:13:57] <jouncebot>	 In 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[12:14:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:21] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert)
[12:15:21] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert)
[12:15:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162 (owner: 10Muehlenhoff)
[12:16:38] <logmsgbot>	 !log cgoubert@deploy1002 Started scap: Redeploying mw-on-k8s - T344904
[12:16:43] <stashbot>	 T344904: Termbox SSR broken on Test Wikidata (since k8s migration? unclear) - https://phabricator.wikimedia.org/T344904
[12:17:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P51310 and previous config saved to /var/cache/conftool/dbconfig/20230824-121725-ladsgroup.json
[12:18:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:18:45] <wikibugs>	 (03PS1) 10Btullis: Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132)
[12:18:46] <logmsgbot>	 !log cgoubert@deploy1002 Finished scap: Redeploying mw-on-k8s - T344904 (duration: 02m 07s)
[12:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51311 and previous config saved to /var/cache/conftool/dbconfig/20230824-121930-ladsgroup.json
[12:20:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43009/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis)
[12:21:57] <wikibugs>	 (03CR) 10Joal: [C: 03+1] Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis)
[12:22:21] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43010/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis)
[12:22:46] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to old extlinks columns in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683)
[12:23:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43011/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis)
[12:24:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: add instance for calculating MW latencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[12:24:39] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis)
[12:25:23] <fabfur>	 !log disabling puppet and pybal on lvs1020 for reboot (T344587) 
[12:25:26] <wikibugs>	 (03PS1) 10Clément Goubert: mw-debug: Use global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904)
[12:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51312 and previous config saved to /var/cache/conftool/dbconfig/20230824-122530-ladsgroup.json
[12:25:45] <fabfur>	 !log errata corrige: not lvs1020 but lvs1018
[12:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:51] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos: add instance for calculating MW latencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[12:26:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162 (owner: 10Muehlenhoff)
[12:26:53] <vgutierrez>	 fabfur: s/lvs1020/lvs1018/ is probably better understood here than latin :)
[12:27:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Remove Stretch, add stub entry for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/952195 (owner: 10Muehlenhoff)
[12:28:33] <wikibugs>	 (03PS2) 10Clément Goubert: mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904)
[12:28:47] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[12:28:53] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[12:29:59] <jinxer-wm>	 (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[12:30:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,refine_event.service,refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[12:31:36] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[12:32:08] <icinga-wm>	 PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[12:32:24] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert)
[12:32:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P51313 and previous config saved to /var/cache/conftool/dbconfig/20230824-123231-ladsgroup.json
[12:33:20] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert)
[12:34:23] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[12:34:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51314 and previous config saved to /var/cache/conftool/dbconfig/20230824-123436-ladsgroup.json
[12:34:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[12:34:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[12:34:56] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] doc: rename user for rsyncing docs [puppet] - 10https://gerrit.wikimedia.org/r/952189 (owner: 10Jaime Nuche)
[12:35:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[12:39:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1001.eqiad.wmnet
[12:40:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51315 and previous config saved to /var/cache/conftool/dbconfig/20230824-124036-ladsgroup.json
[12:43:41] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[12:43:47] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-eqiad.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[12:44:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert)
[12:45:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1001.eqiad.wmnet
[12:45:55] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/952204
[12:47:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:47:37] <wikibugs>	 (03PS1) 10JMeybohm: jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253)
[12:47:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51316 and previous config saved to /var/cache/conftool/dbconfig/20230824-124737-ladsgroup.json
[12:47:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[12:47:41] <wikibugs>	 (03PS1) 10Btullis: Re-enable gobblin, refine, and other jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/952206
[12:47:47] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:47:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[12:47:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51317 and previous config saved to /var/cache/conftool/dbconfig/20230824-124758-ladsgroup.json
[12:48:46] <effie>	 !log depool kartotherian in eqiad
[12:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:31] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43012/console" [puppet] - 10https://gerrit.wikimedia.org/r/952206 (owner: 10Btullis)
[12:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51318 and previous config saved to /var/cache/conftool/dbconfig/20230824-124942-ladsgroup.json
[12:49:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Re-enable gobblin, refine, and other jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/952206 (owner: 10Btullis)
[12:49:48] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[12:54:28] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet
[12:55:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[12:55:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[12:55:30] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:55:40] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:55:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51319 and previous config saved to /var/cache/conftool/dbconfig/20230824-125542-ladsgroup.json
[12:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[12:56:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[12:56:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51320 and previous config saved to /var/cache/conftool/dbconfig/20230824-125607-ladsgroup.json
[12:56:08] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[12:56:16] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:56:40] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:56:58] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:57:02] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:57:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[12:57:14] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:57:44] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet
[12:57:46] <icinga-wm>	 PROBLEM - Host lvs1018 is DOWN: PING CRITICAL - Packet loss = 100%
[12:57:54] <icinga-wm>	 RECOVERY - Host lvs1018 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[12:58:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[12:58:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[12:58:47] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[12:59:04] <sukhe>	 !log running homer "asw1-b*27-esams*" commit "add doh300[34]"
[12:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:14] <icinga-wm>	 PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[12:59:38] <jynus>	 5XX increased a lot, are we ok?
[12:59:48] <effie>	 yes, it is maps testing 
[13:00:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:00:02] <effie>	 I will pool back eqiad in a tiny pit
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:05] <effie>	 bit*
[13:00:25] <jynus>	 ok, sorry, I didn't have context for where those came from
[13:00:40] <icinga-wm>	 RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:02:09] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[13:02:20] <wikibugs>	 (03PS1) 10BBlack: esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207
[13:02:35] <jynus>	 I confirm it is kartotherian only: https://grafana.wikimedia.org/goto/02ZTbEgSz?orgId=1
[13:02:52] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 7.000 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:03:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117
[13:03:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[13:03:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/952204 (owner: 10Marostegui)
[13:03:45] <marostegui>	 !log failover m1-master to dbproxy1022
[13:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:01] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[13:04:08] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 3.913 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:04:10] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[13:04:21] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided)
[13:04:42] <fabfur>	 !log puppet and pybal reenabled on lvs1018 (T344587) 
[13:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51321 and previous config saved to /var/cache/conftool/dbconfig/20230824-130446-ladsgroup.json
[13:04:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207 (owner: 10BBlack)
[13:04:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51322 and previous config saved to /var/cache/conftool/dbconfig/20230824-130455-ladsgroup.json
[13:05:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[13:05:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[13:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51323 and previous config saved to /var/cache/conftool/dbconfig/20230824-130519-ladsgroup.json
[13:05:41] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207 (owner: 10BBlack)
[13:05:49] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) (duration: 01m 27s)
[13:07:16] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:07:20] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 7.899 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:08:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:08:06] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.363 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:08:20] <bblack>	 !log cp3074: restart varnish frontend (changing malloc storage from https://gerrit.wikimedia.org/r/c/operations/puppet/+/952207/ )
[13:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:24] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:08:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[13:08:28] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:08:32] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:09:08] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:10:38] <wikibugs>	 (03PS1) 10Stevemunene: datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874)
[13:11:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51324 and previous config saved to /var/cache/conftool/dbconfig/20230824-131117-ladsgroup.json
[13:11:26] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided)
[13:11:42] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi
[13:11:47] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) (duration: 00m 21s)
[13:12:07] <eoghan>	 acked. Looking
[13:13:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209
[13:13:30] <akosiaris>	 effie ^
[13:13:51] <wikibugs>	 (03PS3) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458)
[13:14:03] <akosiaris>	 eoghan: I think that's the result of depooling eqiad to make sure that codfw could sustain the entire load (it apparently didn't)
[13:14:05] <effie>	 akosiaris:  it will recover 
[13:14:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (owner: 10Andrew Bogott)
[13:14:19] <eoghan>	 akosiaris: Good to know!
[13:14:21] <eoghan>	 It's recovered anyway.
[13:14:26] <wikibugs>	 (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[13:14:28] <effie>	 eoghan: give it some time, sadly we are digging into fixing some things
[13:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:39] <wikibugs>	 (03PS3) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117
[13:14:44] <eoghan>	 effie: No problem! Good luck. Let us know if we can help
[13:15:23] <wikibugs>	 (03PS1) 10Jbond: update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210
[13:15:46] <wikibugs>	 (03PS2) 10Jbond: update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210
[13:15:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210 (owner: 10Jbond)
[13:15:52] <wikibugs>	 (03CR) 10Muehlenhoff: Disable user creation on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (owner: 10Andrew Bogott)
[13:16:42] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors
[13:18:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff)
[13:19:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:19:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51325 and previous config saved to /var/cache/conftool/dbconfig/20230824-131952-ladsgroup.json
[13:20:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, please take into account that data persistence and cloud services are also big users of HAProxy here." [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff)
[13:22:03] <wikibugs>	 (03PS2) 10Andrew Bogott: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209
[13:22:10] <wikibugs>	 (03PS2) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671)
[13:23:05] <fabfur>	 !log disabling puppet and pybal on lvs1017 for reboot (T344587) 
[13:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:20] <wikibugs>	 (03PS1) 10JMeybohm: jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253)
[13:24:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[13:25:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51326 and previous config saved to /var/cache/conftool/dbconfig/20230824-132504-ladsgroup.json
[13:25:09] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:26:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51327 and previous config saved to /var/cache/conftool/dbconfig/20230824-132623-ladsgroup.json
[13:26:32] <wikibugs>	 (03PS1) 10JMeybohm: jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253)
[13:26:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed! Nicely spotted" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:26:49] <wikibugs>	 (03PS4) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458)
[13:27:14] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[13:27:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[13:27:52] <icinga-wm>	 PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:28:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:28:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:28:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:29:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond)
[13:29:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117 (owner: 10Muehlenhoff)
[13:29:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[13:29:37] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:29:44] <wikibugs>	 (03Merged) 10jenkins-bot: jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:31:15] <wikibugs>	 (03Abandoned) 10Jbond: (WIP) puppetdb-microservice: update puppetdb micro service so it streams data [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond)
[13:32:46] <wikibugs>	 (03PS5) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458)
[13:33:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[13:33:44] <wikibugs>	 (03CR) 10Ssingh: wmf-config: remove public subnets from reverse-proxy.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh)
[13:34:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51328 and previous config saved to /var/cache/conftool/dbconfig/20230824-133458-ladsgroup.json
[13:35:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond)
[13:35:54] <wikibugs>	 (03PS2) 10Ssingh: wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704)
[13:36:40] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/952213
[13:37:46] <marostegui>	 !log failover m2-master to dbproxy1023
[13:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/952213 (owner: 10Marostegui)
[13:39:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:40:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P51329 and previous config saved to /var/cache/conftool/dbconfig/20230824-134010-ladsgroup.json
[13:41:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51330 and previous config saved to /var/cache/conftool/dbconfig/20230824-134129-ladsgroup.json
[13:41:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::environment: Simplify environment variable export [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff)
[13:42:16] <wikibugs>	 (03PS1) 10JMeybohm: jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253)
[13:42:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[13:42:33] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Adapt monitoring/metrics rules for nft and ferm providers [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:42:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[13:42:53] <_joe_>	 jouncebot: nowandnext
[13:42:53] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[13:42:54] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[13:42:54] <jouncebot>	 In 2 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600)
[13:43:17] <wikibugs>	 (03PS2) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497)
[13:43:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[13:43:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[13:43:55] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet
[13:43:56] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[13:44:18] <wikibugs>	 (03PS1) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856)
[13:44:39] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis)
[13:44:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:45:51] <wikibugs>	 (03PS9) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485)
[13:45:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:46:02] <wikibugs>	 (03PS3) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497)
[13:46:10] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:46:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:46:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[13:46:28] <bblack>	 !log cp3075: restart varnish frontend (changing malloc storage from https://gerrit.wikimedia.org/r/c/operations/puppet/+/952207/ )
[13:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[13:47:01] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet
[13:47:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[13:47:34] <wikibugs>	 (03PS1) 10Jbond: puppetdb-api-microservice: need to convert current query to json [puppet] - 10https://gerrit.wikimedia.org/r/952216 (https://phabricator.wikimedia.org/T342458)
[13:47:39] <wikibugs>	 (03PS2) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856)
[13:47:52] <icinga-wm>	 RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:47:54] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[13:48:09] <fabfur>	 !log enabled puppet and pybal on lvs1017  (T344587) 
[13:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:13] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis)
[13:48:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb-api-microservice: need to convert current query to json [puppet] - 10https://gerrit.wikimedia.org/r/952216 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond)
[13:48:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:48:54] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[13:50:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51331 and previous config saved to /var/cache/conftool/dbconfig/20230824-135004-ladsgroup.json
[13:50:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:50:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) a:03Jclark-ctr
[13:50:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:50:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[13:50:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:51:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[13:52:04] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis)
[13:53:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[13:53:41] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[13:53:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) Server is in a boot loop  troubleshooting now
[13:54:04] <claime>	 jouncebot: nowandnext
[13:54:04] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[13:54:04] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300)
[13:54:04] <jouncebot>	 In 2 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600)
[13:54:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[13:54:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[13:54:23] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:54:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[13:54:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[13:54:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[13:54:31] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[13:54:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[13:54:34] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[13:54:34] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[13:54:35] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[13:54:39] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[13:54:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[13:54:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[13:54:56] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[13:54:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51332 and previous config saved to /var/cache/conftool/dbconfig/20230824-135456-ladsgroup.json
[13:54:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[13:55:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) 05Resolved→03Open
[13:55:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[13:55:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[13:55:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P51333 and previous config saved to /var/cache/conftool/dbconfig/20230824-135516-ladsgroup.json
[13:55:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[13:55:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[13:55:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[13:55:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[13:55:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[13:56:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51334 and previous config saved to /var/cache/conftool/dbconfig/20230824-135636-ladsgroup.json
[13:56:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[13:56:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[13:57:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51335 and previous config saved to /var/cache/conftool/dbconfig/20230824-135659-ladsgroup.json
[13:58:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) Cool, thanks for the details, makes sens to use `prefix-limit` with `teardown` then, maybe some timeout so it automatically recovers and double check ou...
[13:59:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:00:01] <wikibugs>	 (03PS11) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:00:03] <wikibugs>	 (03PS12) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:00:05] <wikibugs>	 (03PS12) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:00:21] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197 (owner: 10Muehlenhoff)
[14:00:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:00:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:00:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:02:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51336 and previous config saved to /var/cache/conftool/dbconfig/20230824-140218-ladsgroup.json
[14:02:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51337 and previous config saved to /var/cache/conftool/dbconfig/20230824-140226-ladsgroup.json
[14:02:41] <wikibugs>	 (03PS1) 10JMeybohm: jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253)
[14:02:56] <wikibugs>	 (03PS1) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[14:02:59] <wikibugs>	 (03PS12) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:03:01] <wikibugs>	 (03PS13) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:03:03] <wikibugs>	 (03PS13) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:03:14] <wikibugs>	 (03PS2) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/937139
[14:03:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[14:03:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:03:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:03:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:05:11] <wikibugs>	 (03CR) 10Ssingh: "A bit split about this because while I think this is important, we also ran into a bunch of issues with the durum hosts. In any case, my v" [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh)
[14:06:14] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[14:06:50] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10colewhite)
[14:07:10] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) 05Open→03Resolved a:03colewhite
[14:08:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:00] <wikibugs>	 (03PS1) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222
[14:09:41] <wikibugs>	 (03PS2) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[14:10:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51338 and previous config saved to /var/cache/conftool/dbconfig/20230824-141022-ladsgroup.json
[14:10:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:10:30] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:10:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:10:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51339 and previous config saved to /var/cache/conftool/dbconfig/20230824-141043-ladsgroup.json
[14:10:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197 (owner: 10Muehlenhoff)
[14:11:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[14:11:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff)
[14:12:53] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223
[14:17:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[14:17:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51340 and previous config saved to /var/cache/conftool/dbconfig/20230824-141725-ladsgroup.json
[14:17:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51341 and previous config saved to /var/cache/conftool/dbconfig/20230824-141733-ladsgroup.json
[14:18:52] <wikibugs>	 (03PS1) 10Ssingh: test_dns: add new DNS hosts in esams doh300[34] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/952225
[14:18:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: add new DNS hosts in esams doh300[34] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/952225 (owner: 10Ssingh)
[14:21:46] <moritzm>	 !log installing openssl security updates on buster
[14:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:39] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:esams and A:wikidough
[14:25:07] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344872 (10Jhancock.wm) a:03Jhancock.wm
[14:25:25] <wikibugs>	 (03PS13) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:25:27] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:25:29] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:26:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:26:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli)
[14:26:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:26:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:27:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli)
[14:28:24] <wikibugs>	 (03CR) 10Jforrester: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[14:28:40] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:28:42] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:29:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P51342 and previous config saved to /var/cache/conftool/dbconfig/20230824-142900-ladsgroup.json
[14:30:40] <wikibugs>	 (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[14:30:42] <wikibugs>	 (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[14:30:44] <wikibugs>	 (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot)
[14:30:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) Server is out of warranty. pulled dimm from recently decom server  and replaced.  A7.
[14:31:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) 05Open→03Resolved Server is back up and running
[14:31:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:31:39] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930833 (owner: 10PipelineBot)
[14:31:46] <sukhe>	 ^ expected
[14:31:50] <moritzm>	 !log restarting FPM on mw canaries
[14:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Ladsgroup) Thanks for fast fix. I really appreciate it.
[14:32:19] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935141 (owner: 10PipelineBot)
[14:32:30] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935142 (owner: 10PipelineBot)
[14:32:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51343 and previous config saved to /var/cache/conftool/dbconfig/20230824-143231-ladsgroup.json
[14:32:35] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935132 (owner: 10PipelineBot)
[14:32:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51344 and previous config saved to /var/cache/conftool/dbconfig/20230824-143239-ladsgroup.json
[14:35:48] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:36:50] <wikibugs>	 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) Turns out there is one more thing I need to do to. I missed a firmware update. Is it safe for me to reboot at this time?
[14:38:03] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli)
[14:38:47] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli)
[14:38:52] <wikibugs>	 (03PS1) 10Jelto: miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914)
[14:39:36] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:40:16] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:43:08] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:43:47] <wikibugs>	 (03PS10) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485)
[14:43:54] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P51345 and previous config saved to /var/cache/conftool/dbconfig/20230824-144404-ladsgroup.json
[14:44:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet
[14:44:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[14:47:28] <wikibugs>	 (03Abandoned) 10Jdrewniak: Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[14:47:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51346 and previous config saved to /var/cache/conftool/dbconfig/20230824-144737-ladsgroup.json
[14:47:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[14:47:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51347 and previous config saved to /var/cache/conftool/dbconfig/20230824-144745-ladsgroup.json
[14:47:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[14:47:55] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] P:gitlab::runner: Do not schedule untagged jobs on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/952017 (https://phabricator.wikimedia.org/T344874) (owner: 10Dduvall)
[14:47:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[14:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51348 and previous config saved to /var/cache/conftool/dbconfig/20230824-144801-ladsgroup.json
[14:48:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[14:48:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51349 and previous config saved to /var/cache/conftool/dbconfig/20230824-144810-ladsgroup.json
[14:49:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51350 and previous config saved to /var/cache/conftool/dbconfig/20230824-144903-ladsgroup.json
[14:49:09] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:49:13] <wikibugs>	 (03PS11) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485)
[14:49:34] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951877 (owner: 10PipelineBot)
[14:49:56] <wikibugs>	 (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[14:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[14:52:04] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[14:52:40] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[14:52:56] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/934464 (owner: 10PipelineBot)
[14:53:06] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935868 (owner: 10PipelineBot)
[14:53:14] <moritzm>	 !log installing poppler security updates
[14:53:16] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935882 (owner: 10PipelineBot)
[14:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51351 and previous config saved to /var/cache/conftool/dbconfig/20230824-145317-ladsgroup.json
[14:53:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto)
[14:53:26] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/939283 (owner: 10PipelineBot)
[14:53:35] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/940220 (owner: 10PipelineBot)
[14:53:41] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[14:54:06] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet
[14:54:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1019.eqiad.wmnet
[14:54:25] <wikibugs>	 (03Merged) 10jenkins-bot: jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[14:54:39] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/941912 (owner: 10PipelineBot)
[14:54:46] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942791 (owner: 10PipelineBot)
[14:54:55] <wikibugs>	 (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945010 (owner: 10PipelineBot)
[14:55:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:55:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51352 and previous config saved to /var/cache/conftool/dbconfig/20230824-145519-ladsgroup.json
[14:55:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[14:55:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[14:56:02] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:56:05] <wikibugs>	 (03PS15) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:56:07] <wikibugs>	 (03PS15) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:56:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199
[14:56:58] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P51353 and previous config saved to /var/cache/conftool/dbconfig/20230824-145909-ladsgroup.json
[15:00:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:00:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:00:53] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 (owner: 10Giuseppe Lavagetto)
[15:01:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[15:02:23] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:esams and A:wikidough
[15:02:23] <effie>	 !log pool kartotherian on codfw 
[15:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:59] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[15:03:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 (owner: 10Giuseppe Lavagetto)
[15:03:34] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1019.eqiad.wmnet
[15:03:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet
[15:04:09] <wikibugs>	 (03CR) 10Btullis: wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[15:04:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P51354 and previous config saved to /var/cache/conftool/dbconfig/20230824-150410-ladsgroup.json
[15:05:12] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895)
[15:05:28] <wikibugs>	 (03PS1) 10JMeybohm: jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253)
[15:07:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:22] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:22] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:22] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:23] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:24] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:24] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:25] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:26] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:27] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:27] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:28] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:29] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:30] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:30] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:40] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:40] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:40] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:41] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:42] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:42] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:50] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:50] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:52] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:52] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:07:52] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:07:53] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:08:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:08:08] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:08:08] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:08:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:08:14] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:08:14] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:08:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi
[15:08:18] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med
[15:08:18] <icinga-wm>	 from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:08:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51355 and previous config saved to /var/cache/conftool/dbconfig/20230824-150823-ladsgroup.json
[15:09:43] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[15:09:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:10:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:10:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51356 and previous config saved to /var/cache/conftool/dbconfig/20230824-151025-ladsgroup.json
[15:10:27] <wikibugs>	 (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095)
[15:11:16] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:11:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[15:11:30] <vgutierrez>	 urandom, herron ^^ around?
[15:11:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:49] <herron>	 vgutierrez: yep
[15:11:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:11] <urandom>	 vgutierrez: aye
[15:12:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) Here is the ssh public key generated from the new machine:  ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIASWBpeq1Ju1EHwv5Jd7aupwy787kls1Az2ffAPWIPfJ reb...
[15:12:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:13:01] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wcqs,name=eqiad
[15:13:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:13:41] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[15:13:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:13:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[15:14:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P51357 and previous config saved to /var/cache/conftool/dbconfig/20230824-151414-ladsgroup.json
[15:14:43] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43014/console" [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[15:14:43] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[15:14:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:14:56] <_joe_>	 jouncebot: nowandnext
[15:14:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 45 minute(s)
[15:14:56] <jouncebot>	 In 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600)
[15:15:18] <effie>	 ok we have more traffic? 
[15:15:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:15:47] <_joe_>	 what is going on in production right now?
[15:15:49] <vgutierrez>	 effie: not really.. at least not in terms of restbase
[15:16:01] <effie>	 _joe_: I see more rps on apps 
[15:16:06] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková)
[15:16:08] <vgutierrez>	 but restbase started to return 500s by thousands
[15:16:10] <effie>	 and I see many http. error on NEL 
[15:16:18] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase1020.eqiad.wmnet
[15:16:22] <_joe_>	 effie: are you oncall right now?
[15:16:27] <effie>	 _joe_: no 
[15:16:50] <vgutierrez>	 that's https://grafana.wikimedia.org/goto/6gp7_PgIz?orgId=1
[15:16:54] <_joe_>	 there's clearly something wrong
[15:17:17] <_joe_>	 herron, urandom ^^ can you please check what is the pattern of reuqests?
[15:17:55] <_joe_>	 the restbase thing seems resolved
[15:18:07] <effie>	 we are recovering 
[15:18:11] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895) (owner: 10Ilias Sarantopoulos)
[15:18:16] <urandom>	 I was doing a rolling reboot of restbase servers
[15:18:18] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[15:18:40] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: ensure correct ordering when using an intermediate cert [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[15:18:40] <vgutierrez>	 urandom: uh :)
[15:18:46] <urandom>	 what that would have caused the high rate of 500s is puzzling though
[15:18:53] <_joe_>	 so yeah, our baseline of requests is now 5k/s vs 3k/s before we repooled esams
[15:18:55] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[15:19:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:19:06] <_joe_>	 vgutierrez: did we move de and uk back?
[15:19:11] <bblack>	 we didn't yet
[15:19:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P51358 and previous config saved to /var/cache/conftool/dbconfig/20230824-151916-ladsgroup.json
[15:19:19] <_joe_>	 I would suggest we don't 
[15:19:22] <bblack>	 but the rates don't look as bad as yesterday
[15:19:40] <_joe_>	 bblack: yesterday we had a scraper, today we're at a baseline of 5k rps on appservers right now
[15:19:49] <bblack>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=65&from=now-7d&to=now&refresh=1m
[15:19:57] <_joe_>	 ah wait sorry, wrong graph
[15:19:59] <bblack>	 I'm just compariing to how things looked pre-esams.
[15:20:02] <_joe_>	 yeah it's less severe
[15:20:05] <bblack>	 other than this spike just now
[15:20:20] <bblack>	 a few days back, eqiad was peaking ~3k ish, now like 3.3k?
[15:20:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:20:30] <_joe_>	 yeah it's not that bad
[15:20:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:20:39] <logmsgbot>	 !log oblivian@deploy1002 Started scap: (no justification provided)
[15:20:57] <_joe_>	 this is not a true deployment, I'm just rebuilding the docker images
[15:21:48] <effie>	 !log depool kartotherian on eqiad 
[15:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[15:22:17] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[15:22:44] <wikibugs>	 (03PS1) 10Jdrewniak: watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870)
[15:23:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51359 and previous config saved to /var/cache/conftool/dbconfig/20230824-152329-ladsgroup.json
[15:23:46] <hnowlan>	 looks like a complete drop off in requests to mobileapps from restbase in that last burst of errors: https://grafana-rw.wikimedia.org/d/5CmeRcnMz/mobileapps?forceLogin&from=now-30m&orgId=1&to=now&var-container_name=All&var-dc=thanos&var-prometheus=k8s&var-service=mobileapps&var-site=eqiad 
[15:24:02] <hnowlan>	 well no, sorry, big spike in errors 
[15:25:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51360 and previous config saved to /var/cache/conftool/dbconfig/20230824-152531-ladsgroup.json
[15:25:41] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[15:26:13] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[15:26:38] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:27:12] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:27:41] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T337991 (10DAlangi_WMF)
[15:30:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[15:30:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged once the ingress work is completed" [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[15:33:38] <wikibugs>	 (03PS1) 10JMeybohm: PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952242 (https://phabricator.wikimedia.org/T344253)
[15:34:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51361 and previous config saved to /var/cache/conftool/dbconfig/20230824-153422-ladsgroup.json
[15:34:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[15:34:28] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[15:34:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[15:34:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51362 and previous config saved to /var/cache/conftool/dbconfig/20230824-153443-ladsgroup.json
[15:37:38] <wikibugs>	 (03PS1) 10JMeybohm: PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253)
[15:38:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:38:22] <vgutierrez>	 urandom: dunno if expected or not but per https://grafana.wikimedia.org/goto/NNe4XPRIk?orgId=1 metrics_edited-pages_aggregate_-project-_-editor-type-_-page-type-_-activity-level-_-granularity-_-start-_-end got super slow when you started rebooting servers
[15:38:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51363 and previous config saved to /var/cache/conftool/dbconfig/20230824-153835-ladsgroup.json
[15:38:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[15:38:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[15:39:38] <wikibugs>	 (03CR) 10Bking: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking)
[15:40:19] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43015/console" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:40:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51364 and previous config saved to /var/cache/conftool/dbconfig/20230824-154037-ladsgroup.json
[15:40:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[15:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[15:41:03] <wikibugs>	 (03PS1) 10JMeybohm: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253)
[15:41:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51365 and previous config saved to /var/cache/conftool/dbconfig/20230824-154102-ladsgroup.json
[15:41:21] <wikibugs>	 (03PS2) 10JMeybohm: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253)
[15:42:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[15:42:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[15:42:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51366 and previous config saved to /var/cache/conftool/dbconfig/20230824-154238-ladsgroup.json
[15:43:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:44:10] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:45:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[15:45:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[15:45:34] <wikibugs>	 (03PS2) 10JMeybohm: PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253)
[15:48:26] <wikibugs>	 (03PS1) 10JHathaway: dev env: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/952245 (https://phabricator.wikimedia.org/T337970)
[15:48:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51367 and previous config saved to /var/cache/conftool/dbconfig/20230824-154829-ladsgroup.json
[15:49:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] dev env: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/952245 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[15:49:45] <wikibugs>	 (03CR) 10JMeybohm: "labs/private change is at https://gerrit.wikimedia.org/r/c/labs/private/+/952242" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:49:47] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[15:49:52] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952242 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:49:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51368 and previous config saved to /var/cache/conftool/dbconfig/20230824-154956-ladsgroup.json
[15:50:40] <wikibugs>	 (03PS1) 10JMeybohm: aux: Rename the aux profile to match the naming scheme [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253)
[15:50:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:51:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[15:51:03] <wikibugs>	 (03Merged) 10jenkins-bot: data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[15:51:05] <wikibugs>	 (03CR) 10JMeybohm: "This depends on I1b8896cfce4f8f07d979635beacdfd7fe90bd7ed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:51:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[15:54:09] <wikibugs>	 (03PS1) 10Ssingh: lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247
[15:55:08] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43016/console" [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh)
[15:56:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:56:58] <wikibugs>	 (03PS2) 10Ssingh: lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247
[15:57:05] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[15:57:30] <wikibugs>	 (03PS3) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671)
[15:57:32] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh)
[15:58:06] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43017/console" [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh)
[16:00:04] <jouncebot>	 jbond: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600). Please do the needful.
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10Kizule) > What do you plan to use deployment access for? For deploying config patches from https://wikitech.wikimedia.org/wiki/Deployments.
[16:00:18] <sukhe>	 !log disable puppet on A:lvs and A:esams to merge 952247
[16:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:21] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh)
[16:01:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671) (owner: 10Btullis)
[16:01:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, small documentation comment inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[16:02:36] <wikibugs>	 (03PS15) 10Giuseppe Lavagetto: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[16:03:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51369 and previous config saved to /var/cache/conftool/dbconfig/20230824-160335-ladsgroup.json
[16:03:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:04:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Actually, your change is missing some changes that were added to networkpolicy 1.0.1 I think, you should backport it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[16:04:33] <sukhe>	 !log enable puppet on A:lvs and A:esams and force run agent to merge 952247
[16:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20230824-160502-ladsgroup.json
[16:05:11] <wikibugs>	 (03PS1) 10Jbond: pupetdb: add netbox::standalone to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/952251
[16:05:29] <wikibugs>	 (03Abandoned) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis)
[16:06:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pupetdb: add netbox::standalone to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/952251 (owner: 10Jbond)
[16:08:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:09:40] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:10:17] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=eqiad
[16:10:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51371 and previous config saved to /var/cache/conftool/dbconfig/20230824-161050-ladsgroup.json
[16:10:56] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[16:11:28] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) After some discussion in our last ONFIRE meeting it appears that our most basic needs comprise of:  1. A real-time editor for in-the-moment information...
[16:12:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1001.eqiad.wmnet
[16:13:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[16:15:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Let's try it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[16:17:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43018/console" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[16:18:27] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[16:18:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51372 and previous config saved to /var/cache/conftool/dbconfig/20230824-161841-ladsgroup.json
[16:20:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P51373 and previous config saved to /var/cache/conftool/dbconfig/20230824-162013-ladsgroup.json
[16:24:41] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[16:25:33] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[16:25:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[16:25:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P51374 and previous config saved to /var/cache/conftool/dbconfig/20230824-162556-ladsgroup.json
[16:27:53] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[16:28:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[16:28:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:29:00] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:30:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:30:58] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:33:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED
[16:33:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51375 and previous config saved to /var/cache/conftool/dbconfig/20230824-163347-ladsgroup.json
[16:33:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) Benthos is deployed and producing metrics, but I am not closing this yet, because the logs contain quite a lot of e...
[16:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[16:34:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[16:34:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[16:34:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[16:34:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51376 and previous config saved to /var/cache/conftool/dbconfig/20230824-163419-ladsgroup.json
[16:35:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1001.eqiad.wmnet
[16:35:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51377 and previous config saved to /var/cache/conftool/dbconfig/20230824-163519-ladsgroup.json
[16:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[16:35:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[16:35:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51378 and previous config saved to /var/cache/conftool/dbconfig/20230824-163543-ladsgroup.json
[16:38:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025']
[16:39:02] <wikibugs>	 (03PS4) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285)
[16:41:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P51380 and previous config saved to /var/cache/conftool/dbconfig/20230824-164103-ladsgroup.json
[16:41:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51381 and previous config saved to /var/cache/conftool/dbconfig/20230824-164140-ladsgroup.json
[16:43:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51382 and previous config saved to /var/cache/conftool/dbconfig/20230824-164301-ladsgroup.json
[16:48:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2025']
[16:48:40] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron)
[16:49:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025']
[16:49:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2025']
[16:49:30] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron)
[16:50:34] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm)
[16:52:17] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254
[16:56:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51383 and previous config saved to /var/cache/conftool/dbconfig/20230824-165609-ladsgroup.json
[16:56:15] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[16:56:41] <wikibugs>	 (03CR) 10Dduvall: "Thanks for the review, Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[16:56:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51384 and previous config saved to /var/cache/conftool/dbconfig/20230824-165646-ladsgroup.json
[16:57:25] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255
[16:58:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51385 and previous config saved to /var/cache/conftool/dbconfig/20230824-165807-ladsgroup.json
[16:59:53] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254 (owner: 10BryanDavis)
[17:00:06] <jouncebot>	 bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1700).
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1700)
[17:00:26] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255 (owner: 10BryanDavis)
[17:00:36] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254 (owner: 10BryanDavis)
[17:01:13] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255 (owner: 10BryanDavis)
[17:01:36] <bd808>	 I will be deploying both toolhub and developer-portal in today's window (which I probably should rename now that Tech Engagement is gone)
[17:05:18] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply
[17:06:43] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[17:07:03] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[17:08:15] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:08:21] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[17:08:57] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[17:10:03] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[17:10:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[17:10:59] <bd808>	 !log Toolhub updated to a59d37
[17:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:04] <wikibugs>	 (03PS1) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285)
[17:11:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[17:11:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51386 and previous config saved to /var/cache/conftool/dbconfig/20230824-171152-ladsgroup.json
[17:11:53] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:12:16] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:12:23] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:12:43] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:12:49] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:13:09] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:13:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51387 and previous config saved to /var/cache/conftool/dbconfig/20230824-171314-ladsgroup.json
[17:15:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[17:17:20] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@15ed2de]: (no justification provided)
[17:17:40] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@15ed2de]: (no justification provided) (duration: 00m 19s)
[17:21:45] <wikibugs>	 (03PS2) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285)
[17:22:44] <wikibugs>	 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10thcipriani) >>! In T342755#9115945, @fgiunchedi wrote: >>>! In T342755#9114368, @thcipriani wrote: >> Hrm. We get an email from the systemd timer for this, so the alert is probabl...
[17:23:36] <ryankemper>	 !log [WCQS] T344882 `ryankemper@wcqs1003:~$ sudo depool`
[17:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:42] <stashbot>	 T344882: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882
[17:26:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51388 and previous config saved to /var/cache/conftool/dbconfig/20230824-172658-ladsgroup.json
[17:27:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[17:27:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[17:27:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51389 and previous config saved to /var/cache/conftool/dbconfig/20230824-172723-ladsgroup.json
[17:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51390 and previous config saved to /var/cache/conftool/dbconfig/20230824-172820-ladsgroup.json
[17:28:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[17:28:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[17:28:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:28:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51391 and previous config saved to /var/cache/conftool/dbconfig/20230824-172851-ladsgroup.json
[17:30:17] <wikibugs>	 (03CR) 10Bking: [C: 03+2] spdx.rb: Skip SPDX enforcement of txt files [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291) (owner: 10Bking)
[17:34:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51393 and previous config saved to /var/cache/conftool/dbconfig/20230824-173448-ladsgroup.json
[17:34:54] <wikibugs>	 (03PS1) 10Krinkle: Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632)
[17:36:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51394 and previous config saved to /var/cache/conftool/dbconfig/20230824-173609-ladsgroup.json
[17:36:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:39:07] <wikibugs>	 (03PS3) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285)
[17:40:00] <wikibugs>	 (03PS1) 10Cwhite: logstash: move error to error.message when it is a string [puppet] - 10https://gerrit.wikimedia.org/r/951881 (https://phabricator.wikimedia.org/T276468)
[17:46:27] <wikibugs>	 (03PS4) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285)
[17:48:44] <wikibugs>	 (03PS3) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[17:48:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[17:48:58] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:49:53] <wikibugs>	 (03PS4) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[17:49:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51395 and previous config saved to /var/cache/conftool/dbconfig/20230824-174954-ladsgroup.json
[17:50:39] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[17:51:04] <wikibugs>	 (03PS5) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[17:51:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51396 and previous config saved to /var/cache/conftool/dbconfig/20230824-175115-ladsgroup.json
[17:53:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[17:55:53] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:00:05] <jouncebot>	 dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1800).
[18:05:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51397 and previous config saved to /var/cache/conftool/dbconfig/20230824-180500-ladsgroup.json
[18:06:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51398 and previous config saved to /var/cache/conftool/dbconfig/20230824-180621-ladsgroup.json
[18:08:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[18:18:53] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:20:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51399 and previous config saved to /var/cache/conftool/dbconfig/20230824-182006-ladsgroup.json
[18:20:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[18:20:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[18:20:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51400 and previous config saved to /var/cache/conftool/dbconfig/20230824-182032-ladsgroup.json
[18:20:48] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[18:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51401 and previous config saved to /var/cache/conftool/dbconfig/20230824-182128-ladsgroup.json
[18:21:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[18:21:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[18:21:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51402 and previous config saved to /var/cache/conftool/dbconfig/20230824-182151-ladsgroup.json
[18:26:56] <wikibugs>	 (03PS6) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[18:28:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51403 and previous config saved to /var/cache/conftool/dbconfig/20230824-182802-ladsgroup.json
[18:28:52] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:29:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) pc1015  A6 U33 pc1016. C6 U31
[18:29:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) a:03Jclark-ctr
[18:31:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[18:35:41] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51404 and previous config saved to /var/cache/conftool/dbconfig/20230824-184308-ladsgroup.json
[18:46:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[18:48:35] <wikibugs>	 (03PS7) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856)
[18:49:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51405 and previous config saved to /var/cache/conftool/dbconfig/20230824-184915-ladsgroup.json
[18:49:26] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:50:01] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) wdqs1017  D2. U38 wdqs1018 E2 U40 wdqs1019. F2. U39
[18:51:43] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43019/console" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:52:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:52:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[18:52:57] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:53:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) a:03Jclark-ctr
[18:53:21] <wikibugs>	 (03CR) 10Btullis: wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:53:23] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[18:54:52] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725)
[18:54:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot)
[18:55:41] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot)
[18:58:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51406 and previous config saved to /var/cache/conftool/dbconfig/20230824-185816-ladsgroup.json
[18:58:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[19:01:40] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T2000" [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle)
[19:03:18] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.23  refs T343725
[19:03:23] <stashbot>	 T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725
[19:03:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:04:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51407 and previous config saved to /var/cache/conftool/dbconfig/20230824-190422-ladsgroup.json
[19:08:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[19:08:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:09:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:10:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:13:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51408 and previous config saved to /var/cache/conftool/dbconfig/20230824-191322-ladsgroup.json
[19:14:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[19:19:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51409 and previous config saved to /var/cache/conftool/dbconfig/20230824-191928-ladsgroup.json
[19:22:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[19:27:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[19:30:49] <effie>	 !log pool kartotherian  to eqiad and depool from codfw
[19:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:58] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[19:34:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51410 and previous config saved to /var/cache/conftool/dbconfig/20230824-193434-ladsgroup.json
[19:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[19:34:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[19:34:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51411 and previous config saved to /var/cache/conftool/dbconfig/20230824-193458-ladsgroup.json
[19:37:09] <wikibugs>	 10SRE, 10Traffic, 10observability: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000 (10BCornwall) I'm not sure that a smaller period does fix things. Attached is a 5m and 2m. Switching to irate() is showing similar things, too.  {F37627399}  {F37627398}
[19:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51412 and previous config saved to /var/cache/conftool/dbconfig/20230824-194317-ladsgroup.json
[19:55:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[19:58:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51414 and previous config saved to /var/cache/conftool/dbconfig/20230824-195823-ladsgroup.json
[20:00:04] <jouncebot>	 brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T2000).
[20:00:04] <jouncebot>	 jan_drewniak and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:00:56] <MatmaRex>	 hi
[20:01:01] <jan_drewniak>	 o/
[20:03:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/952133
[20:03:35] <effie>	 !log enabling puppet on thanos-fe* hosts
[20:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:45] <brennen>	 o/
[20:04:06] <brennen>	 MatmaRex, jan_drewniak: i'm sort of pressed for time at the moment but let's see what we can do.
[20:04:14] <thcipriani>	 I can deploy here in a sec, too
[20:04:19] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/952133 (owner: 10Effie Mouzeli)
[20:06:19] <icinga-wm>	 PROBLEM - Host logstash1037 is DOWN: PING CRITICAL - Packet loss = 100%
[20:06:23] <icinga-wm>	 RECOVERY - Host logstash1037 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[20:07:05] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870) (owner: 10Jdrewniak)
[20:07:36] <thcipriani>	 I'll get jenkins going for the tests that take a while
[20:08:11] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle)
[20:08:22] <thcipriani>	 and let's do the config in the interim
[20:08:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:09:08] <wikibugs>	 (03PS3) 10Thcipriani: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński)
[20:09:33] <thcipriani>	 MatmaRex: since you put this up for deploy (and it *is* next week), assuming your -1 is null  :)
[20:10:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński)
[20:10:56] <MatmaRex>	 thcipriani: yes, sorry :)
[20:11:03] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński)
[20:11:22] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]]
[20:11:30] <stashbot>	 T341618: Remove deprecated RESTBase-related VE config settings - https://phabricator.wikimedia.org/T341618
[20:12:52] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:13:23] <thcipriani>	 ^ MatmaRex anything to test? not exploding the test since these are "unused"?
[20:13:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:13:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51415 and previous config saved to /var/cache/conftool/dbconfig/20230824-201329-ladsgroup.json
[20:13:31] <kostajh>	 hi, I have a small security patch that would be nice to deploy, if there's time in this window.
[20:14:27] <MatmaRex>	 thcipriani: yeah, nothing specific to test
[20:14:27] <thcipriani>	 kostajh: there's probably room for it, do you need me to deploy or are you able to deploy (I forget)?
[20:14:54] <MatmaRex>	 the visual editor still loads
[20:15:10] <kostajh>	 I'm able to deploy, but I'm not as familiar with syncing security patches so would prefer if someone else with more experience could do it
[20:15:43] <icinga-wm>	 PROBLEM - puppet last run on thanos-fe1001 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:15:47] <thcipriani>	 MatmaRex: just ran the same test :D Thanks for confirming, going live
[20:15:55] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Continuing with sync
[20:16:24] <thcipriani>	 kostajh: happy to deploy, wanna DM me details?
[20:16:38] <kostajh>	 sure, thank you!
[20:17:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:19:47] <icinga-wm>	 PROBLEM - puppet last run on thanos-fe1002 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:21:07] <icinga-wm>	 RECOVERY - puppet last run on thanos-fe1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:21:20] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]] (duration: 09m 58s)
[20:21:25] <stashbot>	 T341618: Remove deprecated RESTBase-related VE config settings - https://phabricator.wikimedia.org/T341618
[20:21:28] <thcipriani>	 ^ MatmaRex live now
[20:21:35] <wikibugs>	 (03Merged) 10jenkins-bot: watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870) (owner: 10Jdrewniak)
[20:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle)
[20:21:53] <icinga-wm>	 PROBLEM - puppet last run on thanos-fe1003 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:22:10] <MatmaRex>	 thanks
[20:22:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:22:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:22:59] <icinga-wm>	 PROBLEM - puppet last run on thanos-fe1004 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:25:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:27:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51416 and previous config saved to /var/cache/conftool/dbconfig/20230824-202836-ladsgroup.json
[20:28:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[20:28:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[20:29:18] <thcipriani>	 kostajh: going to sling out your security patch, then MatmaRex: and jan_drewniak your sync is going out together since one is a maintenance script
[20:29:40] <MatmaRex>	 cool, thanks
[20:29:43] <kostajh>	 +1
[20:29:47] <jan_drewniak>	 thcipriani: thanks!
[20:30:27] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:33:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[20:33:13] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@2455ffd]: (no justification provided)
[20:33:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[20:33:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51417 and previous config saved to /var/cache/conftool/dbconfig/20230824-203322-ladsgroup.json
[20:34:26] <inflatador>	 !log bking@deploy1002 'scap deploy new wdqs T343856'
[20:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:33] <stashbot>	 T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856
[20:35:27] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:37:55] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@2455ffd]: (no justification provided) (duration: 04m 41s)
[20:39:21] <icinga-wm>	 RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[20:40:05] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 299 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:40:23] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[20:40:23] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51418 and previous config saved to /var/cache/conftool/dbconfig/20230824-204035-ladsgroup.json
[20:41:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:42:29] <icinga-wm>	 PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:43:58] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125
[20:44:04] <stashbot>	 T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856
[20:44:11] <jinxer-wm>	 (SystemdUnitFailed) firing: nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:44:18] <thcipriani>	 alright. security patch slung. I'll move on to others.
[20:44:38] <kostajh>	 thanks!
[20:45:47] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]]
[20:45:53] <stashbot>	 T344870: MinervaNeue: Watchstar missing for anonymous users - https://phabricator.wikimedia.org/T344870
[20:45:54] <stashbot>	 T344632: Unable to inspect Global rename script log entries on enwiki - https://phabricator.wikimedia.org/T344632
[20:46:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:47:14] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and jdrewniak and krinkle: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (
[20:47:14] <logmsgbot>	 accessible via k8s-experimental XWD option)
[20:47:46] <thcipriani>	 ^ jan_drewniak your change is live on mwdebug boxen, check please
[20:48:42] <jan_drewniak>	 thcipriani: perfect, thanks!
[20:51:53] <thcipriani>	 jan_drewniak: does that mean you tested and it looks perfect?
[20:52:20] <jan_drewniak>	 thcipriani: yes it does :) 
[20:52:34] <thcipriani>	 ah, ok :D
[20:52:42] <thcipriani>	 going live now
[20:52:57] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and jdrewniak and krinkle: Continuing with sync
[20:53:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[20:53:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:54:35] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:54:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:13] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:55:17] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[20:55:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51419 and previous config saved to /var/cache/conftool/dbconfig/20230824-205541-ladsgroup.json
[20:58:19] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]] (duration: 12m 31s)
[20:58:25] <stashbot>	 T344870: MinervaNeue: Watchstar missing for anonymous users - https://phabricator.wikimedia.org/T344870
[20:58:25] <stashbot>	 T344632: Unable to inspect Global rename script log entries on enwiki - https://phabricator.wikimedia.org/T344632
[20:58:32] <thcipriani>	 ^ jan_drewniak MatmaRex all sync'd now
[20:58:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:58:46] <thcipriani>	 MatmaRex: do you need me to run this maintenance script?
[20:58:49] <MatmaRex>	 thcipriani: thanks. do we have time to run the script too? it should only take a few seconds
[20:58:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:59:03] <thcipriani>	 ah, cool, sure, lemme login to mwmaint
[20:59:27] <MatmaRex>	 on all wikis: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --create-system-user
[20:59:43] <MatmaRex>	 thank you
[20:59:52] <thcipriani>	 so foreachwiki is the right thing, correct?
[21:00:24] <MatmaRex>	 i think so
[21:01:28] <thcipriani>	 !log mwmaint1002:foreachwiki extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --create-system-user # ref. 952132 
[21:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[21:03:27] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:03:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:04:27] <thcipriani>	 MatmaRex: it's going, in the k's now. I'll let you know when it's complete. Got a few "CentralAuth must be enabled. try again" type messages, but nothing else really.
[21:05:12] <MatmaRex>	 ah. i was hoping it'd really be a couple seconds, but i guess just starting the scripts is slower than i thought
[21:06:02] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 22m 03s)
[21:06:07] <stashbot>	 T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856
[21:06:14] <thcipriani>	 MatmaRex: done now
[21:06:25] <icinga-wm>	 RECOVERY - puppet last run on thanos-fe1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:06:29] <MatmaRex>	 thanks thcipriani. sorry for running over
[21:06:44] <MatmaRex>	 it worked as expected, this shows up now: https://en.wikipedia.org/wiki/Special:Log/Global_rename_script
[21:06:47] <thcipriani>	 \o/
[21:06:52] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125
[21:07:15] <thcipriani>	 kudos, alright, calling window complete! Thanks all.
[21:08:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:09:48] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 02m 56s)
[21:10:45] <icinga-wm>	 RECOVERY - puppet last run on thanos-fe1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:10:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51421 and previous config saved to /var/cache/conftool/dbconfig/20230824-211048-ladsgroup.json
[21:11:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[21:13:27] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:13:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:03] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: (no justification provided)
[21:14:05] <icinga-wm>	 RECOVERY - puppet last run on thanos-fe1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:14:43] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: (no justification provided) (duration: 00m 40s)
[21:14:50] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: (no justification provided)
[21:15:45] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: (no justification provided) (duration: 00m 55s)
[21:16:02] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125
[21:16:06] <stashbot>	 T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856
[21:17:30] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518)
[21:18:15] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518) (owner: 10Ryan Kemper)
[21:18:19] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 02m 17s)
[21:18:24] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518) (owner: 10Ryan Kemper)
[21:18:27] <jinxer-wm>	 (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:18:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:19:55] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125
[21:19:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:21:27] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:21:44] <jinxer-wm>	 (SystemdUnitCrashLoop) resolved: wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[21:23:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye
[21:23:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye
[21:25:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51422 and previous config saved to /var/cache/conftool/dbconfig/20230824-212554-ladsgroup.json
[21:26:27] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:28:13] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 08m 18s)
[21:28:18] <stashbot>	 T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856
[21:28:31] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:28:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:28:59] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2007 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:29:15] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:29:16] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125
[21:29:17] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[21:29:31] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 00m 15s)
[21:29:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:31:41] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:33:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:38:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye
[21:38:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[21:38:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:38:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:39:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:41:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:43:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye
[21:43:13] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:43:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye
[21:43:24] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye
[21:43:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[21:43:33] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[21:43:35] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:59] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:46:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:47:48] <wikibugs>	 (03PS1) 10Cwhite: grafana: ensure prometheus/global instances removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196)
[21:48:29] <wikibugs>	 (03PS2) 10Cwhite: grafana: ensure prometheus/global datasources removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196)
[21:59:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye
[21:59:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye
[22:11:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:15:43] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye
[22:15:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[22:16:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:18:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:21:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:41:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:42:15] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:46:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:49:31] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:04:58] <wikibugs>	 (03PS2) 10BBlack: Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi)
[23:06:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi)
[23:08:24] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi)
[23:09:47] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:06] <bblack>	 !log geodns: DE+GB mapped back to esams (were temporarily on drmrs)
[23:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:19:57] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:21:27] <jinxer-wm>	 (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:23:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:53] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:26:27] <jinxer-wm>	 (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:31:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:46:27] <jinxer-wm>	 (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull