[00:00:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:10:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:11:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:15:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:20:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:20:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:21:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2040.codfw.wmnet with OS bullseye [00:21:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2040.codfw.wmnet with OS bullseye completed: - kubernetes2040 (**PASS*... [00:21:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [00:25:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:35:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869 [00:38:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869 (owner: 10TrainBranchBot) [00:40:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:42:22] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:50:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:24:50] (03PS4) 10TTO: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) [01:25:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:26:02] hi all! [01:26:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951869 (owner: 10TrainBranchBot) [01:26:04] Long time no see [01:26:15] Any chance of a look at https://gerrit.wikimedia.org/r/668156/ ? [01:26:23] This affects beta only - does it need to be added to a deployment window? [01:26:28] Or can be merged ad hoc? [01:30:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:35:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:50:57] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:55:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:56:03] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:07:27] (03CR) 10Ladsgroup: [C: 03+1] ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [02:10:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:11:03] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:17:14] PROBLEM - Check systemd state on backup2003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:30:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:31:03] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:34:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:34:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51111 and previous config saved to /var/cache/conftool/dbconfig/20230824-023407-ladsgroup.json [02:35:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1178.eqiad.wmnet with reason: Host needs maint [02:35:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1178.eqiad.wmnet with reason: Host needs maint [02:35:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:39:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51112 and previous config saved to /var/cache/conftool/dbconfig/20230824-023924-ladsgroup.json [02:40:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:42:54] RECOVERY - Check systemd state on backup2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:48:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [02:48:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [02:54:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51113 and previous config saved to /var/cache/conftool/dbconfig/20230824-025431-ladsgroup.json [02:55:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:05:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P51114 and previous config saved to /var/cache/conftool/dbconfig/20230824-030937-ladsgroup.json [03:09:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:09:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:10:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:12:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 6.847 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:13:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:15:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:20:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:21:05] (03CR) 10Krinkle: "Feel free to schedule for Backport deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [03:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51115 and previous config saved to /var/cache/conftool/dbconfig/20230824-032443-ladsgroup.json [03:24:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:25:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:25:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51116 and previous config saved to /var/cache/conftool/dbconfig/20230824-032508-ladsgroup.json [03:25:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [03:25:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [03:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51117 and previous config saved to /var/cache/conftool/dbconfig/20230824-032545-ladsgroup.json [03:25:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:25:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:26:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51118 and previous config saved to /var/cache/conftool/dbconfig/20230824-032633-ladsgroup.json [03:30:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:32:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [03:32:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [03:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51119 and previous config saved to /var/cache/conftool/dbconfig/20230824-033240-ladsgroup.json [03:39:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881) [03:39:16] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/951871 (https://phabricator.wikimedia.org/T344881) [03:40:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [03:40:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [03:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51120 and previous config saved to /var/cache/conftool/dbconfig/20230824-034056-ladsgroup.json [03:45:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P51121 and previous config saved to /var/cache/conftool/dbconfig/20230824-034747-ladsgroup.json [03:48:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51122 and previous config saved to /var/cache/conftool/dbconfig/20230824-034815-ladsgroup.json [03:55:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:57:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [03:57:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [04:00:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:01:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [04:01:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [04:01:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51123 and previous config saved to /var/cache/conftool/dbconfig/20230824-040139-ladsgroup.json [04:02:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P51124 and previous config saved to /var/cache/conftool/dbconfig/20230824-040253-ladsgroup.json [04:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P51125 and previous config saved to /var/cache/conftool/dbconfig/20230824-040321-ladsgroup.json [04:05:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:06:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51126 and previous config saved to /var/cache/conftool/dbconfig/20230824-040656-ladsgroup.json [04:07:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51127 and previous config saved to /var/cache/conftool/dbconfig/20230824-040808-ladsgroup.json [04:10:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:14:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [04:14:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [04:14:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51128 and previous config saved to /var/cache/conftool/dbconfig/20230824-041421-ladsgroup.json [04:14:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:15:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance [04:15:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance [04:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51129 and previous config saved to /var/cache/conftool/dbconfig/20230824-041537-ladsgroup.json [04:15:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:18:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51130 and previous config saved to /var/cache/conftool/dbconfig/20230824-041759-ladsgroup.json [04:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P51131 and previous config saved to /var/cache/conftool/dbconfig/20230824-041827-ladsgroup.json [04:21:22] (03PS1) 10Ladsgroup: db1178: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/952036 (https://phabricator.wikimedia.org/T344880) [04:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P51132 and previous config saved to /var/cache/conftool/dbconfig/20230824-042202-ladsgroup.json [04:23:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P51133 and previous config saved to /var/cache/conftool/dbconfig/20230824-042314-ladsgroup.json [04:27:40] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Marostegui) [04:27:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51134 and previous config saved to /var/cache/conftool/dbconfig/20230824-042740-ladsgroup.json [04:28:23] (03CR) 10Ladsgroup: [C: 03+2] db1178: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/952036 (https://phabricator.wikimedia.org/T344880) (owner: 10Ladsgroup) [04:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T344589)', diff saved to https://phabricator.wikimedia.org/P51135 and previous config saved to /var/cache/conftool/dbconfig/20230824-043334-ladsgroup.json [04:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51136 and previous config saved to /var/cache/conftool/dbconfig/20230824-043619-ladsgroup.json [04:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P51137 and previous config saved to /var/cache/conftool/dbconfig/20230824-043709-ladsgroup.json [04:38:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P51138 and previous config saved to /var/cache/conftool/dbconfig/20230824-043820-ladsgroup.json [04:39:09] (03CR) 10Ladsgroup: [C: 03+1] "IMO, it's good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [04:42:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P51139 and previous config saved to /var/cache/conftool/dbconfig/20230824-044247-ladsgroup.json [04:51:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51140 and previous config saved to /var/cache/conftool/dbconfig/20230824-045125-ladsgroup.json [04:51:57] (03PS2) 10Ladsgroup: Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) [04:52:12] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [04:52:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51141 and previous config saved to /var/cache/conftool/dbconfig/20230824-045215-ladsgroup.json [04:52:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [04:52:20] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:52:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [04:52:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [04:52:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51142 and previous config saved to /var/cache/conftool/dbconfig/20230824-045236-ladsgroup.json [04:52:54] (03Merged) 10jenkins-bot: Stop writing to the old columns of extlinks in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951436 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [04:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T344589)', diff saved to https://phabricator.wikimedia.org/P51143 and previous config saved to /var/cache/conftool/dbconfig/20230824-045326-ladsgroup.json [04:53:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [04:53:38] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]] [04:53:42] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [04:53:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [04:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51144 and previous config saved to /var/cache/conftool/dbconfig/20230824-045352-ladsgroup.json [04:54:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51145 and previous config saved to /var/cache/conftool/dbconfig/20230824-045447-ladsgroup.json [04:55:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51146 and previous config saved to /var/cache/conftool/dbconfig/20230824-045504-ladsgroup.json [04:55:16] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:56:13] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [04:57:22] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P51147 and previous config saved to /var/cache/conftool/dbconfig/20230824-045753-ladsgroup.json [04:58:48] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51148 and previous config saved to /var/cache/conftool/dbconfig/20230824-050137-ladsgroup.json [05:01:54] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:951436|Stop writing to the old columns of extlinks in enwiki (T342683)]] (duration: 08m 16s) [05:01:59] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [05:06:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P51149 and previous config saved to /var/cache/conftool/dbconfig/20230824-050632-ladsgroup.json [05:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P51150 and previous config saved to /var/cache/conftool/dbconfig/20230824-050953-ladsgroup.json [05:10:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P51151 and previous config saved to /var/cache/conftool/dbconfig/20230824-051010-ladsgroup.json [05:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T344589)', diff saved to https://phabricator.wikimedia.org/P51152 and previous config saved to /var/cache/conftool/dbconfig/20230824-051259-ladsgroup.json [05:16:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P51153 and previous config saved to /var/cache/conftool/dbconfig/20230824-051644-ladsgroup.json [05:19:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881 [05:19:22] T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881 [05:19:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881 [05:19:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1138 with weight 0 T344881', diff saved to https://phabricator.wikimedia.org/P51154 and previous config saved to /var/cache/conftool/dbconfig/20230824-051951-ladsgroup.json [05:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51155 and previous config saved to /var/cache/conftool/dbconfig/20230824-052138-ladsgroup.json [05:21:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:21:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:21:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:22:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:22:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51156 and previous config saved to /var/cache/conftool/dbconfig/20230824-052208-ladsgroup.json [05:25:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P51157 and previous config saved to /var/cache/conftool/dbconfig/20230824-052459-ladsgroup.json [05:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P51158 and previous config saved to /var/cache/conftool/dbconfig/20230824-052517-ladsgroup.json [05:25:46] 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Tgr) According to the [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ad097dd6ef45fad3612ca33371f5c478870fbaa6/modules/swift/templates/proxy-s... [05:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51159 and previous config saved to /var/cache/conftool/dbconfig/20230824-052829-ladsgroup.json [05:30:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:31:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P51160 and previous config saved to /var/cache/conftool/dbconfig/20230824-053150-ladsgroup.json [05:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:40:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343718)', diff saved to https://phabricator.wikimedia.org/P51161 and previous config saved to /var/cache/conftool/dbconfig/20230824-054005-ladsgroup.json [05:40:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:40:12] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:40:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:40:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:40:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T343718)', diff saved to https://phabricator.wikimedia.org/P51162 and previous config saved to /var/cache/conftool/dbconfig/20230824-054023-ladsgroup.json [05:40:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [05:40:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51163 and previous config saved to /var/cache/conftool/dbconfig/20230824-054033-ladsgroup.json [05:40:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [05:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51164 and previous config saved to /var/cache/conftool/dbconfig/20230824-054044-ladsgroup.json [05:40:58] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51165 and previous config saved to /var/cache/conftool/dbconfig/20230824-054244-ladsgroup.json [05:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P51166 and previous config saved to /var/cache/conftool/dbconfig/20230824-054335-ladsgroup.json [05:46:23] (03PS2) 10Ladsgroup: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot) [05:46:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951870 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot) [05:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T344589)', diff saved to https://phabricator.wikimedia.org/P51167 and previous config saved to /var/cache/conftool/dbconfig/20230824-054656-ladsgroup.json [05:47:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [05:47:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [05:47:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:47:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:47:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51168 and previous config saved to /var/cache/conftool/dbconfig/20230824-054726-ladsgroup.json [05:48:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:50:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:50:58] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:53:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:54:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51169 and previous config saved to /var/cache/conftool/dbconfig/20230824-055511-ladsgroup.json [05:57:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P51170 and previous config saved to /var/cache/conftool/dbconfig/20230824-055750-ladsgroup.json [05:58:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [05:58:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [05:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P51171 and previous config saved to /var/cache/conftool/dbconfig/20230824-055842-ladsgroup.json [05:58:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51172 and previous config saved to /var/cache/conftool/dbconfig/20230824-055846-ladsgroup.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0600) [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0600). [06:00:11] o/ [06:00:15] about to switchover s4 [06:00:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:01:47] !log Starting s4 eqiad failover from db1160 to db1138 - T344881 [06:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:52] T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881 [06:01:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T344881', diff saved to https://phabricator.wikimedia.org/P51173 and previous config saved to /var/cache/conftool/dbconfig/20230824-060157-ladsgroup.json [06:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T344881', diff saved to https://phabricator.wikimedia.org/P51174 and previous config saved to /var/cache/conftool/dbconfig/20230824-060245-ladsgroup.json [06:04:16] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/951871 (https://phabricator.wikimedia.org/T344881) (owner: 10Gerrit maintenance bot) [06:04:44] 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Urbanecm_WMF) Thanks for the info, @Tgr! I [fixed](https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/bf3e99977c63c4b65bfd211d3fd960e7700f5d5f%... [06:05:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1160 T344881', diff saved to https://phabricator.wikimedia.org/P51175 and previous config saved to /var/cache/conftool/dbconfig/20230824-060647-ladsgroup.json [06:06:54] T344881: Switchover s4 master (db1160 -> db1138) - https://phabricator.wikimedia.org/T344881 [06:08:16] 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Urbanecm_WMF) 05Open→03Resolved p:05Triage→03High a:03Urbanecm_WMF Boldly resolving. [06:09:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [06:09:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [06:09:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51176 and previous config saved to /var/cache/conftool/dbconfig/20230824-060924-ladsgroup.json [06:09:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:09:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P51177 and previous config saved to /var/cache/conftool/dbconfig/20230824-061017-ladsgroup.json [06:12:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P51178 and previous config saved to /var/cache/conftool/dbconfig/20230824-061256-ladsgroup.json [06:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T344589)', diff saved to https://phabricator.wikimedia.org/P51179 and previous config saved to /var/cache/conftool/dbconfig/20230824-061348-ladsgroup.json [06:13:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:14:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:14:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51180 and previous config saved to /var/cache/conftool/dbconfig/20230824-061413-ladsgroup.json [06:14:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:15:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:17:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51181 and previous config saved to /var/cache/conftool/dbconfig/20230824-061748-ladsgroup.json [06:18:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:18:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:18:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51182 and previous config saved to /var/cache/conftool/dbconfig/20230824-061813-ladsgroup.json [06:20:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51183 and previous config saved to /var/cache/conftool/dbconfig/20230824-062127-ladsgroup.json [06:21:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51184 and previous config saved to /var/cache/conftool/dbconfig/20230824-062143-ladsgroup.json [06:21:48] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:24:31] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P51185 and previous config saved to /var/cache/conftool/dbconfig/20230824-062523-ladsgroup.json [06:26:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51186 and previous config saved to /var/cache/conftool/dbconfig/20230824-062645-ladsgroup.json [06:27:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:27:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:28:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343718)', diff saved to https://phabricator.wikimedia.org/P51187 and previous config saved to /var/cache/conftool/dbconfig/20230824-062802-ladsgroup.json [06:28:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:28:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:28:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:28:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51188 and previous config saved to /var/cache/conftool/dbconfig/20230824-062824-ladsgroup.json [06:29:31] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:30:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883) [06:31:03] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883 [06:31:24] T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883 [06:31:36] (03PS1) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045 [06:31:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883 [06:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2179 with weight 0 T344883', diff saved to https://phabricator.wikimedia.org/P51189 and previous config saved to /var/cache/conftool/dbconfig/20230824-063240-ladsgroup.json [06:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P51190 and previous config saved to /var/cache/conftool/dbconfig/20230824-063255-ladsgroup.json [06:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P51191 and previous config saved to /var/cache/conftool/dbconfig/20230824-063633-ladsgroup.json [06:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P51192 and previous config saved to /var/cache/conftool/dbconfig/20230824-063649-ladsgroup.json [06:40:22] !log killed mwscript updateSpecialPages.php metawiki --override --only=Mostlinked blocking db depool [06:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T344589)', diff saved to https://phabricator.wikimedia.org/P51193 and previous config saved to /var/cache/conftool/dbconfig/20230824-064030-ladsgroup.json [06:40:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:40:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51194 and previous config saved to /var/cache/conftool/dbconfig/20230824-064044-ladsgroup.json [06:40:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P51195 and previous config saved to /var/cache/conftool/dbconfig/20230824-064152-ladsgroup.json [06:42:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51196 and previous config saved to /var/cache/conftool/dbconfig/20230824-064205-ladsgroup.json [06:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:42:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [06:48:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P51197 and previous config saved to /var/cache/conftool/dbconfig/20230824-064801-ladsgroup.json [06:48:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [06:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51198 and previous config saved to /var/cache/conftool/dbconfig/20230824-064830-ladsgroup.json [06:51:04] (03CR) 10Muehlenhoff: Make nftables::service types more compatible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P51199 and previous config saved to /var/cache/conftool/dbconfig/20230824-065140-ladsgroup.json [06:51:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P51200 and previous config saved to /var/cache/conftool/dbconfig/20230824-065155-ladsgroup.json [06:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:55:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org [06:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P51201 and previous config saved to /var/cache/conftool/dbconfig/20230824-065658-ladsgroup.json [06:57:43] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Fix networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:57:46] (03CR) 10JMeybohm: [C: 03+2] modules/base: networkpolicy_1.0.1 Add support for extraRules [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:57:48] (03CR) 10JMeybohm: [C: 03+2] modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:59:06] (03Merged) 10jenkins-bot: modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:59:08] (03Merged) 10jenkins-bot: modules/base: networkpolicy_1.0.1 Add support for extraRules [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:59:10] (03Merged) 10jenkins-bot: wikifunctions: Fix networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [06:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:59:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org [07:00:04] Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T0700). Please do the needful. [07:00:04] tto and kizule: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] morning! [07:00:19] Kizule hi there! [07:00:24] Hi! [07:00:24] apergos g'day! [07:00:44] My patch is 2.5 years old, please treat it gently [07:00:49] we have a trainee signed up for today to learn how to deploy. I'll wait for them to show up in google meet (I don't have their irc nick to ping them here). [07:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:01:15] are either of you, tto and Kizule, self-deployers or will you need our assistance today? [07:01:23] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:01:25] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:01:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org [07:01:34] No, I'll be needing your assistance [07:01:54] apergos: What you mean by assistance? I don't have access to any of servers. ;) [07:02:19] then we'll be doing the deployment, and asking you to test at a couple of points during the process. all good! [07:02:35] Just fyi, I'm on a slightly unstable connection, so if I disappear I'll reconnect asap [07:02:39] okay :) [07:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T344589)', diff saved to https://phabricator.wikimedia.org/P51202 and previous config saved to /var/cache/conftool/dbconfig/20230824-070307-ladsgroup.json [07:03:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [07:03:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [07:03:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51203 and previous config saved to /var/cache/conftool/dbconfig/20230824-070332-ladsgroup.json [07:03:33] tto: I notice that you have a cr -1 about an issue which I assume was addressed in the latest patchset. however if you could get a cr on that before I deploy, that would be good. [07:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P51204 and previous config saved to /var/cache/conftool/dbconfig/20230824-070343-ladsgroup.json [07:04:01] Yes, I did address that, the CR was asking me to add the extension to wmf-config/extension-list, which I did [07:04:09] Kizule: your patch looks good to go, as soon as our trainee arrives, or after 5 more minutes, whichever comes first :-) [07:04:17] Who would one get CR from at this hour? I'm out of the loop on these things [07:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Pool es2025', diff saved to https://phabricator.wikimedia.org/P51205 and previous config saved to /var/cache/conftool/dbconfig/20230824-070417-ladsgroup.json [07:04:53] I can give +1. ;) [07:05:02] tto: deployers running the window aren't really supposed to be doing cr, we would expect patches to come to us with +1 on them alredy, though there is some discussion as to whether that should apply to config patches, see here: https://phabricator.wikimedia.org/T344409 [07:05:25] Reedy: you awake yet? [07:05:36] it's pretty early for him I think [07:05:47] A live discussion on that task, I see [07:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org [07:05:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:06:30] The deploy is not urgent, i'm happy to wait for another time if you'd prefer. I note that this is a low-risk patch, as it only touches beta cluster, but your call in the end [07:06:41] Reedy is in UK right? He'd likely be asleep [07:06:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T344589)', diff saved to https://phabricator.wikimedia.org/P51206 and previous config saved to /var/cache/conftool/dbconfig/20230824-070646-ladsgroup.json [07:06:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [07:07:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T343718)', diff saved to https://phabricator.wikimedia.org/P51207 and previous config saved to /var/cache/conftool/dbconfig/20230824-070702-ladsgroup.json [07:07:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:07:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [07:07:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:07:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51208 and previous config saved to /var/cache/conftool/dbconfig/20230824-070710-ladsgroup.json [07:07:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51209 and previous config saved to /var/cache/conftool/dbconfig/20230824-070723-ladsgroup.json [07:08:18] tto: if no one comes along who can give a meaningful +1 (I couldn't, for example) in time for the morning window, then yes if you don't mind, I'd ask you to wait. and thanks for being understanding about it. [07:08:23] It's 8am. He's UK like me I think. [07:08:36] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:08:45] As a general comment (maybe I should add it to the task), the experience of getting things done as a volunteer (in my case, getting a new extension deployed) is already incredibly difficult if you "don't know the right people", so I'd not support anything that would add hurdles to that experience [07:08:59] in two minutes if our trainee has not shown up, I'll proceed with your patch, Kizule [07:09:01] Anyway if anyone is able to CR, great, otherwise, let's leave it for now [07:09:02] Sadly I have to go straight into busy at work so can't help [07:09:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:09:28] tto: I totally understand, and yes, you should comment right on the task where other people will see it [07:09:36] apergos: Sounds good [07:09:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51210 and previous config saved to /var/cache/conftool/dbconfig/20230824-070946-ladsgroup.json [07:09:54] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:10:49] *ding* our trainee is late or has not got the date right, so I will proceed [07:11:06] (03PS2) 10JMeybohm: admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) [07:11:33] (03CR) 10ArielGlenn: [C: 03+2] [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: 10Zoranzoki21) [07:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T344589)', diff saved to https://phabricator.wikimedia.org/P51211 and previous config saved to /var/cache/conftool/dbconfig/20230824-071204-ladsgroup.json [07:12:15] (03Merged) 10jenkins-bot: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: 10Zoranzoki21) [07:12:44] (03PS1) 10Ayounsi: Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579) [07:13:05] !log ariel@deploy1002 Started scap: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]] [07:13:10] T344816: Delete the Index namespace at English Wiktionary - https://phabricator.wikimedia.org/T344816 [07:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51212 and previous config saved to /var/cache/conftool/dbconfig/20230824-071323-ladsgroup.json [07:13:29] (03PS1) 10Aklapper: phabricator: Stop logging Bugzilla redirector misses [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884) [07:14:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3003.wikimedia.org [07:14:40] !log ariel@deploy1002 zoranzoki21 and ariel: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:14:57] Kizule: your change is live on mwdebug1002, please test it there [07:14:57] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [07:15:15] apergos: Okay, I'll keep you updated. [07:15:22] great! [07:15:38] (03PS1) 10Ayounsi: Homer-public: remove mentions of knams [homer/public] - 10https://gerrit.wikimedia.org/r/952048 (https://phabricator.wikimedia.org/T344579) [07:16:57] (03CR) 10JMeybohm: [C: 03+1] dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff) [07:16:59] apergos: Good to go [07:17:19] okay, proceeding. [07:17:24] !log ariel@deploy1002 zoranzoki21 and ariel: Continuing with sync [07:17:28] (03Merged) 10jenkins-bot: admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [07:17:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51213 and previous config saved to /var/cache/conftool/dbconfig/20230824-071757-ladsgroup.json [07:18:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:18:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P51214 and previous config saved to /var/cache/conftool/dbconfig/20230824-071849-ladsgroup.json [07:19:25] (03PS1) 10Ayounsi: netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579) [07:21:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3003.wikimedia.org [07:21:05] (03CR) 10Ayounsi: [C: 03+2] Homer-public: remove mentions of knams [homer/public] - 10https://gerrit.wikimedia.org/r/952048 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [07:21:27] (03CR) 10Ayounsi: [C: 03+2] netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [07:21:57] (03Merged) 10jenkins-bot: netbox reports: remove mentions of knams [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/952049 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [07:22:43] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:22:56] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Deool es2025', diff saved to https://phabricator.wikimedia.org/P51215 and previous config saved to /var/cache/conftool/dbconfig/20230824-072301-ladsgroup.json [07:23:07] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:951914|[enwiktionary] Remove the Index and Index_talk namespaces (T344816)]] (duration: 10m 01s) [07:23:12] Kizule: your change is now live in production, please test it there :-) [07:23:14] T344816: Delete the Index namespace at English Wiktionary - https://phabricator.wikimedia.org/T344816 [07:24:12] Looks good, thank you! [07:24:18] great! [07:24:30] thanks Kizule and apergos! (this was actually a task I filed :) ) [07:24:39] sweet! [07:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P51216 and previous config saved to /var/cache/conftool/dbconfig/20230824-072453-ladsgroup.json [07:25:11] You're welcome. apergos: Can you give me link to page where is training mentioned? I can't find it. [07:25:21] sure! [07:25:35] https://wikitech.wikimedia.org/wiki/Deployments/Training [07:25:39] this talks about it [07:26:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org [07:26:15] apergos: Thanks, I wanted that one. [07:26:24] if you want to get trained (which I recommend, whether I'm doing it or someone else, we always want more deployers!), you can sign up by making a phab task here: https://phabricator.wikimedia.org/project/board/5265/ [07:26:49] I mean, tag it with that and it will go right into the backlog for someone to set it up with you. [07:27:21] Yes, that's why I asked. I'm already working on creating a task per instructions from page on Wikitech. [07:27:24] Thanks! [07:27:38] excellent! maybe I'll see you at one of these sessions as a trainee. [07:28:04] tto: I'm happpy to keep the window open for awhile yet, in case Reedy or someone else shows up who would do that +1 [07:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P51217 and previous config saved to /var/cache/conftool/dbconfig/20230824-072829-ladsgroup.json [07:29:08] Thanks for offering apergos, but all good. Rather than waiting around, let's both go and enjoy our days! [07:29:23] ok! see everyone next time, have a great rest of your day! [07:30:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org [07:30:23] !log UTC morning backport and config deployment window complete [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org [07:33:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P51218 and previous config saved to /var/cache/conftool/dbconfig/20230824-073304-ladsgroup.json [07:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51219 and previous config saved to /var/cache/conftool/dbconfig/20230824-073355-ladsgroup.json [07:35:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org [07:36:56] 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10Kizule) [07:37:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org [07:38:46] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:39:05] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:39:14] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:39:36] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:39:50] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [07:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P51220 and previous config saved to /var/cache/conftool/dbconfig/20230824-073959-ladsgroup.json [07:41:11] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [07:41:17] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [07:41:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org [07:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51221 and previous config saved to /var/cache/conftool/dbconfig/20230824-074216-ladsgroup.json [07:42:28] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [07:42:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883 [07:42:40] T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883 [07:43:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344883 [07:43:04] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Add jaeger user to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:43:32] (03PS1) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) [07:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P51222 and previous config saved to /var/cache/conftool/dbconfig/20230824-074336-ladsgroup.json [07:43:57] (03CR) 10CI reject: [V: 04-1] firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:44:06] (03PS1) 10JMeybohm: Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253) [07:45:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:46:36] (03CR) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:46:45] (03PS2) 10Giuseppe Lavagetto: Use ClusterConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951046 [07:46:47] (03PS2) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 [07:46:49] (03PS2) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 [07:46:52] (03PS2) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 [07:47:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51223 and previous config saved to /var/cache/conftool/dbconfig/20230824-074708-ladsgroup.json [07:47:14] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:48:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P51224 and previous config saved to /var/cache/conftool/dbconfig/20230824-074810-ladsgroup.json [07:48:28] (03CR) 10JMeybohm: [C: 03+2] Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:49:08] (03Abandoned) 10JMeybohm: Revert "aux: add grpc/http ports for jaeger collector" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950822 (owner: 10Filippo Giunchedi) [07:49:17] (03PS2) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) [07:50:08] (03CR) 10Ladsgroup: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:50:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51225 and previous config saved to /var/cache/conftool/dbconfig/20230824-075028-ladsgroup.json [07:50:55] (03Merged) 10jenkins-bot: Move jaeger from admin_ng to aux services [deployment-charts] - 10https://gerrit.wikimedia.org/r/952052 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:50:59] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:51:05] (03CR) 10Slyngshede: [C: 03+2] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:53:17] 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10fgiunchedi) >>! In T342755#9114368, @thcipriani wrote: > Hrm. We get an email from the systemd timer for this, so the alert is probably not necessary. > > We're not very familiar... [07:54:41] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) +1 on my end FWIW [07:54:51] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P51226 and previous config saved to /var/cache/conftool/dbconfig/20230824-075505-ladsgroup.json [07:55:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [07:55:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [07:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51227 and previous config saved to /var/cache/conftool/dbconfig/20230824-075529-ladsgroup.json [07:56:36] (03PS3) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 [07:56:38] (03PS3) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 [07:56:40] (03PS3) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 [07:57:06] (03CR) 10Ladsgroup: [C: 03+1] ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:57:10] (03CR) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51228 and previous config saved to /var/cache/conftool/dbconfig/20230824-075722-ladsgroup.json [07:57:44] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:57:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:58:06] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T344589)', diff saved to https://phabricator.wikimedia.org/P51229 and previous config saved to /var/cache/conftool/dbconfig/20230824-075842-ladsgroup.json [07:58:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [07:59:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [07:59:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51230 and previous config saved to /var/cache/conftool/dbconfig/20230824-075906-ladsgroup.json [08:00:38] (03CR) 10Filippo Giunchedi: [C: 03+1] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [08:00:38] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) The new version of the script have been deployed, but not ye... [08:01:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [08:01:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:02:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P51231 and previous config saved to /var/cache/conftool/dbconfig/20230824-080214-ladsgroup.json [08:02:34] (03PS1) 10JMeybohm: jaeger: Fix path to helmfile-defaults secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/952111 (https://phabricator.wikimedia.org/T344253) [08:02:54] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] jaeger: Fix path to helmfile-defaults secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/952111 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [08:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51232 and previous config saved to /var/cache/conftool/dbconfig/20230824-080316-ladsgroup.json [08:03:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:03:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:03:23] (03PS2) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [08:03:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:03:57] (03CR) 10CI reject: [V: 04-1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51233 and previous config saved to /var/cache/conftool/dbconfig/20230824-080522-ladsgroup.json [08:05:23] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:05:35] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:05:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P51234 and previous config saved to /var/cache/conftool/dbconfig/20230824-080534-ladsgroup.json [08:06:11] (03CR) 10Filippo Giunchedi: [C: 03+1] Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [08:06:23] (03PS3) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [08:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [08:07:11] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [08:07:13] (03CR) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:07:27] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers thanos-fe1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:07:41] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:07:51] PROBLEM - SSH on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:07:57] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers thanos-fe1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:08:11] sigh, that's me causing a thanos timeout by running a query :( [08:08:20] should recover by itself [08:08:28] Write better queries :-) [08:09:21] slyngs: haha! [08:09:26] * godog frantically hits refresh [08:09:27] PROBLEM - thanos.wikimedia.org tls expiry on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:09:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [08:10:05] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 9.669 second response time https://wikitech.wikimedia.org/wiki/Thanos [08:10:11] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:10:23] Seems funky that a query can cause a tls expiry alert :-) [08:10:25] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:10:35] RECOVERY - SSH on thanos-fe1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:39] (03CR) 10Ladsgroup: [C: 03+1] Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto) [08:10:45] RECOVERY - thanos.wikimedia.org tls expiry on thanos-fe1004 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 21 Jul 2025 03:04:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:10:49] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:11:26] more of a case of the silly check that fires on timeouts [08:11:30] (03PS2) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883) (owner: 10Gerrit maintenance bot) [08:11:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/951873 (https://phabricator.wikimedia.org/T344883) (owner: 10Gerrit maintenance bot) [08:12:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P51235 and previous config saved to /var/cache/conftool/dbconfig/20230824-081229-ladsgroup.json [08:13:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [08:14:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [08:14:14] !log Starting s4 codfw failover from db2140 to db2179 - T344883 [08:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:19] T344883: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T344883 [08:14:20] (03Merged) 10jenkins-bot: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [08:14:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2179 to s4 primary T344883', diff saved to https://phabricator.wikimedia.org/P51236 and previous config saved to /var/cache/conftool/dbconfig/20230824-081442-ladsgroup.json [08:15:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org [08:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2140 T344883', diff saved to https://phabricator.wikimedia.org/P51237 and previous config saved to /var/cache/conftool/dbconfig/20230824-081654-ladsgroup.json [08:17:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P51238 and previous config saved to /var/cache/conftool/dbconfig/20230824-081720-ladsgroup.json [08:17:44] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [08:17:50] (03PS1) 10Slyngshede: C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) [08:18:13] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:19:13] (03PS2) 10Filippo Giunchedi: sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) [08:19:28] (WidespreadPuppetFailure) firing: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:19:33] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [08:19:39] (03CR) 10Filippo Giunchedi: sre: add bandaid alert for prometheus not reloading its k8s certs (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:20:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org [08:20:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:20:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:21:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:33] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10taavi) Obligatory reading: {T282786} (2021) Backports: https://gerrit.wikimedia.org/r/q/owner:Zoranzoki21+-branch:master Config changes: https://gerrit.wikimedia.... [08:22:47] jouncebot: nowandnext [08:22:48] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [08:22:48] In 1 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000) [08:22:48] In 1 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000) [08:23:20] (03PS2) 10Slyngshede: C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) [08:23:23] (03PS3) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) [08:24:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [08:24:11] (03PS1) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117 [08:25:03] (03Merged) 10jenkins-bot: Set OATHAuth multiple devices WRITE_BOTH for all fishbowls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951367 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [08:25:31] !log taavi@deploy1002 Started scap: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]] [08:25:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43000/console" [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:25:40] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [08:26:40] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:bigtop::hadoop ensure net-topology script is installed. [puppet] - 10https://gerrit.wikimedia.org/r/952112 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:27:05] !log taavi@deploy1002 taavi: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:27:36] !log taavi@deploy1002 taavi: Continuing with sync [08:27:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Change db2179 groups', diff saved to https://phabricator.wikimedia.org/P51239 and previous config saved to /var/cache/conftool/dbconfig/20230824-082742-ladsgroup.json [08:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P51240 and previous config saved to /var/cache/conftool/dbconfig/20230824-082748-ladsgroup.json [08:27:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P51241 and previous config saved to /var/cache/conftool/dbconfig/20230824-082757-ladsgroup.json [08:28:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:28:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51242 and previous config saved to /var/cache/conftool/dbconfig/20230824-082814-ladsgroup.json [08:28:27] (03CR) 10Clément Goubert: [C: 03+1] sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:28:57] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [08:29:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [08:30:07] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs and A:cp [08:30:18] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [08:30:33] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs and A:cp [08:30:53] (03CR) 10Filippo Giunchedi: prometheus: Add recording rules for istio traffic on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:30:56] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [08:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T343718)', diff saved to https://phabricator.wikimedia.org/P51243 and previous config saved to /var/cache/conftool/dbconfig/20230824-083226-ladsgroup.json [08:32:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:32:32] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:32:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51244 and previous config saved to /var/cache/conftool/dbconfig/20230824-083248-ladsgroup.json [08:33:16] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:951367|Set OATHAuth multiple devices WRITE_BOTH for all fishbowls (T242031)]] (duration: 07m 45s) [08:33:21] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [08:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:35:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51245 and previous config saved to /var/cache/conftool/dbconfig/20230824-083537-ladsgroup.json [08:35:43] (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:35:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:36:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:36:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51246 and previous config saved to /var/cache/conftool/dbconfig/20230824-083644-ladsgroup.json [08:37:22] (03PS4) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) [08:37:52] (03CR) 10Muehlenhoff: [C: 03+2] Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:38:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:40:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43001/console" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51247 and previous config saved to /var/cache/conftool/dbconfig/20230824-084055-ladsgroup.json [08:40:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:41:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:41:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:41:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:42:01] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:42:30] PROBLEM - Check systemd state on kubernetes1026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51248 and previous config saved to /var/cache/conftool/dbconfig/20230824-084303-ladsgroup.json [08:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P51249 and previous config saved to /var/cache/conftool/dbconfig/20230824-084304-ladsgroup.json [08:50:37] (03CR) 10Ladsgroup: [C: 03+1] Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 (owner: 10Giuseppe Lavagetto) [08:50:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P51250 and previous config saved to /var/cache/conftool/dbconfig/20230824-085044-ladsgroup.json [08:51:37] (03CR) 10Btullis: [V: 03+1] Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:52:55] (03PS1) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) [08:53:19] (03CR) 10CI reject: [V: 04-1] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [08:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51251 and previous config saved to /var/cache/conftool/dbconfig/20230824-085551-ladsgroup.json [08:56:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P51252 and previous config saved to /var/cache/conftool/dbconfig/20230824-085602-ladsgroup.json [08:56:08] (03PS2) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) [08:56:14] (03CR) 10Btullis: [V: 03+1] Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:56:26] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:56:31] (03CR) 10CI reject: [V: 04-1] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [08:58:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:08] (03PS3) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) [08:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T344589)', diff saved to https://phabricator.wikimedia.org/P51253 and previous config saved to /var/cache/conftool/dbconfig/20230824-085810-ladsgroup.json [08:58:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [08:58:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [08:58:30] (03CR) 10jenkins-bot: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [08:58:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51254 and previous config saved to /var/cache/conftool/dbconfig/20230824-085834-ladsgroup.json [09:00:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [09:00:18] (03PS4) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) [09:00:50] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1130 days) https://wikitech.wikimedia.org/wiki/Logs [09:03:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:42] (03PS1) 10Slyngshede: C:bigtop::hadoop Fix script path [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480) [09:04:44] RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:04] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43003/console" [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:05:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P51255 and previous config saved to /var/cache/conftool/dbconfig/20230824-090550-ladsgroup.json [09:05:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51256 and previous config saved to /var/cache/conftool/dbconfig/20230824-090559-ladsgroup.json [09:06:56] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:bigtop::hadoop Fix script path [puppet] - 10https://gerrit.wikimedia.org/r/952125 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:08:59] (03CR) 10Btullis: [V: 03+1] "Looks good. I'm happy to +1 after we sort out the duplication (or decide to ignore it)." [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [09:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P51257 and previous config saved to /var/cache/conftool/dbconfig/20230824-091057-ladsgroup.json [09:10:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:11:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P51258 and previous config saved to /var/cache/conftool/dbconfig/20230824-091108-ladsgroup.json [09:13:17] (03PS1) 10Jbond: Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129 [09:13:33] (03CR) 10CI reject: [V: 04-1] Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129 (owner: 10Jbond) [09:13:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43004/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [09:15:21] (03Abandoned) 10Jbond: Revert "C:bigtop::hadoop move net-topology.py to files." [puppet] - 10https://gerrit.wikimedia.org/r/952129 (owner: 10Jbond) [09:15:37] slyngs: fyi yout patch is causing https://puppetboard.wikimedia.org/nodes?status=failed, working on fix now (cc btullis) [09:16:04] jbond: Thank you. [09:17:39] I already fixed it [09:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51259 and previous config saved to /var/cache/conftool/dbconfig/20230824-091741-ladsgroup.json [09:17:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:17:50] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952125/1/modules/bigtop/manifests/hadoop.pp [09:18:20] that explains why i cant see the proiblem [09:18:43] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) We're meeting with them in the next couple of weeks to troubleshoot our scraping problems. Will report back once w... [09:19:20] To be fair neither could I, so I had to compare it to the beeline patch I did earlier [09:19:22] slyngs: good to run the following once yu send a fix (running now) [09:19:23] sudo cumin -p0 -b 40 '*' 'run-puppet-agent --failed-only -q' [09:19:28] (WidespreadPuppetFailure) resolved: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:20:06] Oh that's a lot of hosts [09:20:15] just needed to wait 5 more mins for the recovery :) [09:20:32] yes but it is a no op unless puppet has failed [09:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51260 and previous config saved to /var/cache/conftool/dbconfig/20230824-092056-ladsgroup.json [09:21:00] in this case you could have used used C:bigtop::hadoop to limit things [09:21:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:21:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P51261 and previous config saved to /var/cache/conftool/dbconfig/20230824-092105-ladsgroup.json [09:21:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51262 and previous config saved to /var/cache/conftool/dbconfig/20230824-092122-ladsgroup.json [09:21:30] Oh, yeah, that would have been faster... I'll just let it run [09:21:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51263 and previous config saved to /var/cache/conftool/dbconfig/20230824-092147-ladsgroup.json [09:21:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [09:23:40] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:25:51] (03PS5) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) [09:25:57] (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:26:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P51264 and previous config saved to /var/cache/conftool/dbconfig/20230824-092603-ladsgroup.json [09:26:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51265 and previous config saved to /var/cache/conftool/dbconfig/20230824-092614-ladsgroup.json [09:26:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:26:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:26:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:26:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51266 and previous config saved to /var/cache/conftool/dbconfig/20230824-092636-ladsgroup.json [09:26:50] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:27:42] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:27:47] btullis: Sorry about that, should be all good now [09:28:12] All good, thanks <3 [09:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51267 and previous config saved to /var/cache/conftool/dbconfig/20230824-092846-ladsgroup.json [09:28:48] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951155 (owner: 10Muehlenhoff) [09:30:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51268 and previous config saved to /var/cache/conftool/dbconfig/20230824-093008-ladsgroup.json [09:32:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [09:32:20] (03PS6) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) [09:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P51269 and previous config saved to /var/cache/conftool/dbconfig/20230824-093247-ladsgroup.json [09:33:23] (03PS1) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) [09:35:13] (03PS2) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) [09:35:33] (03CR) 10Filippo Giunchedi: "Nicely done! LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [09:36:10] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P51270 and previous config saved to /var/cache/conftool/dbconfig/20230824-093611-ladsgroup.json [09:36:22] (03PS3) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) [09:36:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [09:36:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [09:37:39] (03PS1) 10Mvolz: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773) [09:40:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [09:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P51271 and previous config saved to /var/cache/conftool/dbconfig/20230824-094109-ladsgroup.json [09:41:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [09:42:43] (03CR) 10Ayounsi: [C: 03+2] Puppet: remove all mentions of knams [puppet] - 10https://gerrit.wikimedia.org/r/952046 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [09:42:43] !log removed stretch-wikimedia from apt.wikimedia.org (obsolete) [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P51272 and previous config saved to /var/cache/conftool/dbconfig/20230824-094352-ladsgroup.json [09:44:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:45:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51273 and previous config saved to /var/cache/conftool/dbconfig/20230824-094515-ladsgroup.json [09:45:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1002.eqiad.wmnet [09:45:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [09:45:43] (03CR) 10Alexandros Kosiaris: envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:45:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:45:50] (03CR) 10Klausman: [C: 03+2] prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [09:46:32] (03PS1) 10Muehlenhoff: profile::environment: Simplify environment variable export [puppet] - 10https://gerrit.wikimedia.org/r/952150 [09:47:07] (03CR) 10Clément Goubert: envoy: Add concurrency control to envoy cmdline (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:47:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P51274 and previous config saved to /var/cache/conftool/dbconfig/20230824-094753-ladsgroup.json [09:47:57] (03PS1) 10Filippo Giunchedi: hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) [09:49:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1002.eqiad.wmnet [09:51:15] (03PS1) 10JMeybohm: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) [09:51:17] (03PS1) 10JMeybohm: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) [09:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T344589)', diff saved to https://phabricator.wikimedia.org/P51275 and previous config saved to /var/cache/conftool/dbconfig/20230824-095117-ladsgroup.json [09:51:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [09:51:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [09:52:01] !log reboot lvs1020 to apply patch (T344587) [09:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:11] (03CR) 10CI reject: [V: 04-1] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:52:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff) [09:53:04] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:41] (03PS1) 10Muehlenhoff: Remove stretch Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/952154 [09:53:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [09:54:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [09:56:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove stretch Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/952154 (owner: 10Muehlenhoff) [09:56:41] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:57:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [09:57:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [09:57:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:58:10] (03PS1) 10Effie Mouzeli: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) [09:58:14] (03PS5) 10Kamila Součková: benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) [09:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P51276 and previous config saved to /var/cache/conftool/dbconfig/20230824-095858-ladsgroup.json [09:59:07] (03PS2) 10JMeybohm: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) [09:59:09] (03PS2) 10JMeybohm: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) [09:59:45] (03CR) 10Kamila Součková: "Thank you Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1000) [10:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51277 and previous config saved to /var/cache/conftool/dbconfig/20230824-100021-ladsgroup.json [10:00:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [10:01:51] (03CR) 10Mvolz: [C: 03+2] Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773) (owner: 10Mvolz) [10:02:02] !log end reboot of lvs1020 (pybal service enabled) (T344587) [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:32] (03Merged) 10jenkins-bot: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/952149 (https://phabricator.wikimedia.org/T118773) (owner: 10Mvolz) [10:02:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343718)', diff saved to https://phabricator.wikimedia.org/P51278 and previous config saved to /var/cache/conftool/dbconfig/20230824-100259-ladsgroup.json [10:03:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:03:04] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:03:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:03:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51279 and previous config saved to /var/cache/conftool/dbconfig/20230824-100321-ladsgroup.json [10:03:33] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:03:53] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:04:29] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:04:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [10:05:40] (03PS2) 10Effie Mouzeli: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) [10:05:55] (03PS1) 10Muehlenhoff: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157 [10:06:13] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [10:06:43] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [10:07:32] (03CR) 10JMeybohm: [C: 03+2] jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:07:42] (03CR) 10JMeybohm: [C: 03+2] jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:07:49] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) (owner: 10Effie Mouzeli) [10:08:15] (03Merged) 10jenkins-bot: jeager: Add networkpolicy support to es-index-cleaner [deployment-charts] - 10https://gerrit.wikimedia.org/r/952152 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:08:21] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:08:22] (03Merged) 10jenkins-bot: jeager: Disable creation of service accounts, add networkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/952153 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [10:08:53] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:08:56] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:09:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC says yes https://puppet-compiler.wmflabs.org/output/952121/43005/centrallog1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:11:35] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951867 (owner: 10PipelineBot) [10:12:18] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951867 (owner: 10PipelineBot) [10:12:50] (03CR) 10Filippo Giunchedi: [C: 03+1] Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157 (owner: 10Muehlenhoff) [10:13:29] (03CR) 10Kamila Součková: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43007/console" [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343718)', diff saved to https://phabricator.wikimedia.org/P51280 and previous config saved to /var/cache/conftool/dbconfig/20230824-101405-ladsgroup.json [10:14:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:14:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:14:31] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:14:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51281 and previous config saved to /var/cache/conftool/dbconfig/20230824-101437-ladsgroup.json [10:14:50] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add bandaid alert for prometheus not reloading its k8s certs [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [10:14:52] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:15:16] !log Disable puppet on thanos-fe (eqiad), rollout cfssl on thanos-fe in codfw [10:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51282 and previous config saved to /var/cache/conftool/dbconfig/20230824-101527-ladsgroup.json [10:15:44] (03CR) 10Effie Mouzeli: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/952156 (https://phabricator.wikimedia.org/T343987) (owner: 10Effie Mouzeli) [10:16:06] (03PS1) 10Clément Goubert: mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) [10:16:09] (03PS1) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) [10:16:15] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51283 and previous config saved to /var/cache/conftool/dbconfig/20230824-101647-ladsgroup.json [10:16:50] (03CR) 10CI reject: [V: 04-1] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:17:21] (03CR) 10Muehlenhoff: [C: 03+2] Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/952157 (owner: 10Muehlenhoff) [10:17:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [10:17:53] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:18:36] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [10:18:40] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:20] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [10:20:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:09] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:21:04] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:21:48] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:22:11] !log pool kartotherian on codfw [10:22:13] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:39] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [10:23:00] (03PS1) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) [10:24:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [10:24:43] (03PS1) 10Muehlenhoff: haproxy: Simplify systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/952161 [10:25:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:25:31] (03CR) 10Jbond: [C: 03+1] Make nftables::service types more compatible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:26:17] (03PS1) 10Muehlenhoff: statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162 [10:26:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [10:28:07] (03PS2) 10Clément Goubert: mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) [10:28:09] (03PS2) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) [10:28:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51284 and previous config saved to /var/cache/conftool/dbconfig/20230824-102848-ladsgroup.json [10:29:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:31:13] (03CR) 10Btullis: "I wonder about whether we need to notify any kafka-jumbo clients about the increase in maximum message size." [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [10:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P51285 and previous config saved to /var/cache/conftool/dbconfig/20230824-103153-ladsgroup.json [10:32:46] !log stopping pybal and rebooting lvs1019 (T344587) [10:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:53] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:19] (03CR) 10Btullis: "Do we need to apply this change in deployment-prep as well?" [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [10:34:59] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51286 and previous config saved to /var/cache/conftool/dbconfig/20230824-103948-ladsgroup.json [10:39:54] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:42:41] (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos: add instance for calculating MW latencies [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:43:53] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:43:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P51287 and previous config saved to /var/cache/conftool/dbconfig/20230824-104354-ladsgroup.json [10:47:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P51288 and previous config saved to /var/cache/conftool/dbconfig/20230824-104659-ladsgroup.json [10:48:44] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:49:16] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:49:48] ^^ expected [10:49:54] this is me [10:51:41] (03PS1) 10Kamila Součková: benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) [10:51:46] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [10:53:53] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:00] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) >>! In T344547#9108360, @ayounsi wrote: > Some downsides I can think off: additional config, more complex to troubleshot (more prefixes in the routing t... [10:54:31] (03PS2) 10Kamila Součková: benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) [10:54:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I can't count the number of times this has bit me..." [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P51289 and previous config saved to /var/cache/conftool/dbconfig/20230824-105454-ladsgroup.json [10:55:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:55:48] (03CR) 10Kamila Součková: [C: 03+2] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:56:23] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] benthos: fix missing quotes in config file [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:58:00] (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43008/console" [puppet] - 10https://gerrit.wikimedia.org/r/952164 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [10:59:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P51290 and previous config saved to /var/cache/conftool/dbconfig/20230824-105900-ladsgroup.json [11:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343718)', diff saved to https://phabricator.wikimedia.org/P51291 and previous config saved to /var/cache/conftool/dbconfig/20230824-110206-ladsgroup.json [11:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:02:12] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:02:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51292 and previous config saved to /var/cache/conftool/dbconfig/20230824-110226-ladsgroup.json [11:02:29] kamila_ _joe_ I was convinced CI would validate yaml in /files/ by itself, clearly I was misremembering [11:02:41] lunch, bbl [11:02:57] apparently not :D [11:03:30] (that's a config file that happens to be yaml, not a puppet yaml file though... should it validate in that case?) [11:03:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:03:53] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51293 and previous config saved to /var/cache/conftool/dbconfig/20230824-110537-ladsgroup.json [11:09:44] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [11:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P51294 and previous config saved to /var/cache/conftool/dbconfig/20230824-111001-ladsgroup.json [11:10:03] (RedisMemoryFull) firing: (6) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [11:12:49] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [11:12:52] PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:00] PROBLEM - Webrequests Varnishkafka log producer on cp3074 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:13:02] RECOVERY - Host lvs1019 is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [11:13:04] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [11:13:34] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:14:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T344589)', diff saved to https://phabricator.wikimedia.org/P51295 and previous config saved to /var/cache/conftool/dbconfig/20230824-111407-ladsgroup.json [11:14:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:14:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:14:28] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51296 and previous config saved to /var/cache/conftool/dbconfig/20230824-111432-ladsgroup.json [11:15:00] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:15:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [11:16:02] !log lvs1019 up and running (T344587) [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:31] (03PS3) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) [11:17:52] (03PS1) 10Clément Goubert: mediawiki: Generalize tls-proxy limits removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952171 (https://phabricator.wikimedia.org/T344814) [11:18:54] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 81 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [11:20:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff) [11:20:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P51297 and previous config saved to /var/cache/conftool/dbconfig/20230824-112043-ladsgroup.json [11:20:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51298 and previous config saved to /var/cache/conftool/dbconfig/20230824-112052-ladsgroup.json [11:23:22] varnishkafka-webrequest service is stopped on cp3074, is it something someone is working on? [11:23:58] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:25:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343718)', diff saved to https://phabricator.wikimedia.org/P51299 and previous config saved to /var/cache/conftool/dbconfig/20230824-112507-ladsgroup.json [11:25:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:25:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:25:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:25:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:25:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51300 and previous config saved to /var/cache/conftool/dbconfig/20230824-112532-ladsgroup.json [11:26:02] RECOVERY - Webrequests Varnishkafka log producer on cp3074 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:26:17] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951877 [11:28:42] fabfur: Thanks. as per #wikimedi-sre I went ahead and started the varnishkafka-webrequest service on cp3074 [11:28:53] (RedisMemoryFull) firing: (6) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [11:29:00] btullis: thank you!! [11:31:35] !log foreachwikiindblist fishbowl extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php | tee oathauth-multiple-fishbowl.log # T242031 [11:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:40] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [11:32:46] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:35:04] (03PS1) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952184 (https://phabricator.wikimedia.org/T242031) [11:35:08] (03PS1) 10Majavah: Set OATHAuth multiple devices READ_NEW for checkuser, techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952185 (https://phabricator.wikimedia.org/T242031) [11:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P51301 and previous config saved to /var/cache/conftool/dbconfig/20230824-113550-ladsgroup.json [11:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P51302 and previous config saved to /var/cache/conftool/dbconfig/20230824-113559-ladsgroup.json [11:42:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff) [11:43:08] (03CR) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:48:30] (03PS3) 10Muehlenhoff: firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) [11:48:32] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:49:11] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:50:14] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1614 days) https://wikitech.wikimedia.org/wiki/Logs [11:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343718)', diff saved to https://phabricator.wikimedia.org/P51303 and previous config saved to /var/cache/conftool/dbconfig/20230824-115056-ladsgroup.json [11:50:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:51:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:51:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P51304 and previous config saved to /var/cache/conftool/dbconfig/20230824-115105-ladsgroup.json [11:51:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:52:46] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:54:37] (03CR) 10Muehlenhoff: [C: 03+2] Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:56:40] (03PS1) 10Jaime Nuche: doc: rename user for rsyncing docs [puppet] - 10https://gerrit.wikimedia.org/r/952189 [12:00:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [12:00:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [12:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51305 and previous config saved to /var/cache/conftool/dbconfig/20230824-120218-ladsgroup.json [12:02:24] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:02:50] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_drmrs and A:cp [12:03:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [12:03:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [12:03:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51306 and previous config saved to /var/cache/conftool/dbconfig/20230824-120352-ladsgroup.json [12:04:28] (03PS1) 10Clément Goubert: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) [12:06:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T344589)', diff saved to https://phabricator.wikimedia.org/P51307 and previous config saved to /var/cache/conftool/dbconfig/20230824-120611-ladsgroup.json [12:06:31] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_drmrs and A:cp [12:06:34] (03PS1) 10Btullis: Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/952193 (https://phabricator.wikimedia.org/T344671) [12:06:36] (03PS1) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671) [12:06:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:06:58] (03PS1) 10Muehlenhoff: os-reports: Remove Stretch, add stub entry for Bullseye (data updates still needed) [puppet] - 10https://gerrit.wikimedia.org/r/952195 [12:07:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:07:43] (03CR) 10Btullis: [C: 03+2] Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/952193 (https://phabricator.wikimedia.org/T344671) (owner: 10Btullis) [12:08:21] (03PS2) 10Clément Goubert: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) [12:08:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:09:09] (03PS1) 10Muehlenhoff: mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197 [12:09:17] (03CR) 10CI reject: [V: 04-1] os-reports: Remove Stretch, add stub entry for Bullseye (data updates still needed) [puppet] - 10https://gerrit.wikimedia.org/r/952195 (owner: 10Muehlenhoff) [12:09:59] (RedisMemoryFull) firing: Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16378&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:10:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51308 and previous config saved to /var/cache/conftool/dbconfig/20230824-121024-ladsgroup.json [12:10:30] (03PS1) 10Muehlenhoff: etcd: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/952198 [12:11:21] (03CR) 10Joal: [C: 03+1] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [12:11:23] (03PS2) 10Muehlenhoff: os-reports: Remove Stretch, add stub entry for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/952195 [12:11:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [12:11:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [12:11:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51309 and previous config saved to /var/cache/conftool/dbconfig/20230824-121158-ladsgroup.json [12:12:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert) [12:13:48] (03PS1) 10Giuseppe Lavagetto: httpd: fix ecs logging event duration format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 [12:13:53] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:13:56] jouncebot: nowandnext [12:13:56] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [12:13:57] In 0 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [12:13:57] In 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [12:14:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:21] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert) [12:15:21] (03Merged) 10jenkins-bot: mediawiki: Add egress rules for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952191 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert) [12:15:52] (03CR) 10Filippo Giunchedi: [C: 03+1] statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162 (owner: 10Muehlenhoff) [12:16:38] !log cgoubert@deploy1002 Started scap: Redeploying mw-on-k8s - T344904 [12:16:43] T344904: Termbox SSR broken on Test Wikidata (since k8s migration? unclear) - https://phabricator.wikimedia.org/T344904 [12:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P51310 and previous config saved to /var/cache/conftool/dbconfig/20230824-121725-ladsgroup.json [12:18:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:18:45] (03PS1) 10Btullis: Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) [12:18:46] !log cgoubert@deploy1002 Finished scap: Redeploying mw-on-k8s - T344904 (duration: 02m 07s) [12:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51311 and previous config saved to /var/cache/conftool/dbconfig/20230824-121930-ladsgroup.json [12:20:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43009/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis) [12:21:57] (03CR) 10Joal: [C: 03+1] Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis) [12:22:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43010/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis) [12:22:46] (03PS1) 10Ladsgroup: Stop writing to old extlinks columns in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683) [12:23:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43011/console" [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis) [12:24:20] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: add instance for calculating MW latencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [12:24:39] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable gobblin and refine jobs temporaily on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/952200 (https://phabricator.wikimedia.org/T325132) (owner: 10Btullis) [12:25:23] !log disabling puppet and pybal on lvs1020 for reboot (T344587) [12:25:26] (03PS1) 10Clément Goubert: mw-debug: Use global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) [12:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51312 and previous config saved to /var/cache/conftool/dbconfig/20230824-122530-ladsgroup.json [12:25:45] !log errata corrige: not lvs1020 but lvs1018 [12:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:51] (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos: add instance for calculating MW latencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952121 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [12:26:51] (03CR) 10Muehlenhoff: [C: 03+2] statsite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952162 (owner: 10Muehlenhoff) [12:26:53] fabfur: s/lvs1020/lvs1018/ is probably better understood here than latin :) [12:27:14] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Remove Stretch, add stub entry for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/952195 (owner: 10Muehlenhoff) [12:28:33] (03PS2) 10Clément Goubert: mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) [12:28:47] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:28:53] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:29:59] (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:30:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,refine_event.service,refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:24] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:31:36] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [12:32:08] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:32:24] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert) [12:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P51313 and previous config saved to /var/cache/conftool/dbconfig/20230824-123231-ladsgroup.json [12:33:20] (03Merged) 10jenkins-bot: mw-debug: Copy global mw egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/952203 (https://phabricator.wikimedia.org/T344904) (owner: 10Clément Goubert) [12:34:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51314 and previous config saved to /var/cache/conftool/dbconfig/20230824-123436-ladsgroup.json [12:34:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:34:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:34:56] (03CR) 10EoghanGaffney: [C: 03+2] doc: rename user for rsyncing docs [puppet] - 10https://gerrit.wikimedia.org/r/952189 (owner: 10Jaime Nuche) [12:35:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:39:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1001.eqiad.wmnet [12:40:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51315 and previous config saved to /var/cache/conftool/dbconfig/20230824-124036-ladsgroup.json [12:43:41] (03CR) 10JMeybohm: [C: 03+1] mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [12:43:47] (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-eqiad.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:44:07] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [12:45:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1001.eqiad.wmnet [12:45:55] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/952204 [12:47:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:47:37] (03PS1) 10JMeybohm: jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253) [12:47:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343718)', diff saved to https://phabricator.wikimedia.org/P51316 and previous config saved to /var/cache/conftool/dbconfig/20230824-124737-ladsgroup.json [12:47:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:47:41] (03PS1) 10Btullis: Re-enable gobblin, refine, and other jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/952206 [12:47:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:47:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:47:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51317 and previous config saved to /var/cache/conftool/dbconfig/20230824-124758-ladsgroup.json [12:48:46] !log depool kartotherian in eqiad [12:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43012/console" [puppet] - 10https://gerrit.wikimedia.org/r/952206 (owner: 10Btullis) [12:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51318 and previous config saved to /var/cache/conftool/dbconfig/20230824-124942-ladsgroup.json [12:49:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Re-enable gobblin, refine, and other jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/952206 (owner: 10Btullis) [12:49:48] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [12:54:28] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [12:55:02] (03CR) 10Ayounsi: [C: 03+1] devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [12:55:27] (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:55:30] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:55:40] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T344589)', diff saved to https://phabricator.wikimedia.org/P51319 and previous config saved to /var/cache/conftool/dbconfig/20230824-125542-ladsgroup.json [12:55:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:56:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:56:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51320 and previous config saved to /var/cache/conftool/dbconfig/20230824-125607-ladsgroup.json [12:56:08] (03Merged) 10jenkins-bot: jaeger: Fix label selector for es-index-cleaner job [deployment-charts] - 10https://gerrit.wikimedia.org/r/952205 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [12:56:16] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:56:40] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:56:58] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:57:02] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:57:09] (03CR) 10Ssingh: [C: 03+2] devices: add doh300[34] to asw1-b*27-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951581 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [12:57:14] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:57:44] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [12:57:46] PROBLEM - Host lvs1018 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] RECOVERY - Host lvs1018 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:58:32] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:58:39] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:58:47] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:59:04] !log running homer "asw1-b*27-esams*" commit "add doh300[34]" [12:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:14] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:59:38] 5XX increased a lot, are we ok? [12:59:48] yes, it is maps testing [13:00:00] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:00:02] I will pool back eqiad in a tiny pit [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:05] bit* [13:00:25] ok, sorry, I didn't have context for where those came from [13:00:40] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:02:09] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [13:02:20] (03PS1) 10BBlack: esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207 [13:02:35] I confirm it is kartotherian only: https://grafana.wikimedia.org/goto/02ZTbEgSz?orgId=1 [13:02:52] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 7.000 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:03:00] (03PS2) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117 [13:03:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:03:31] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/952204 (owner: 10Marostegui) [13:03:45] !log failover m1-master to dbproxy1022 [13:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:01] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [13:04:08] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 3.913 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:04:10] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [13:04:21] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) [13:04:42] !log puppet and pybal reenabled on lvs1018 (T344587) [13:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51321 and previous config saved to /var/cache/conftool/dbconfig/20230824-130446-ladsgroup.json [13:04:55] (03CR) 10Ssingh: [C: 03+1] esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207 (owner: 10BBlack) [13:04:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T344589)', diff saved to https://phabricator.wikimedia.org/P51322 and previous config saved to /var/cache/conftool/dbconfig/20230824-130455-ladsgroup.json [13:05:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [13:05:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [13:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51323 and previous config saved to /var/cache/conftool/dbconfig/20230824-130519-ladsgroup.json [13:05:41] (03CR) 10BBlack: [C: 03+2] esams frontend memory: upload floor at least 120 [puppet] - 10https://gerrit.wikimedia.org/r/952207 (owner: 10BBlack) [13:05:49] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) (duration: 01m 27s) [13:07:16] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:07:20] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 7.899 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:08:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:08:06] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.363 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:08:20] !log cp3074: restart varnish frontend (changing malloc storage from https://gerrit.wikimedia.org/r/c/operations/puppet/+/952207/ ) [13:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:08:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:08:28] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:08:32] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:09:08] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:10:38] (03PS1) 10Stevemunene: datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) [13:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51324 and previous config saved to /var/cache/conftool/dbconfig/20230824-131117-ladsgroup.json [13:11:26] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) [13:11:42] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi [13:11:47] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) (duration: 00m 21s) [13:12:07] acked. Looking [13:13:28] (03PS1) 10Andrew Bogott: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 [13:13:30] effie ^ [13:13:51] (03PS3) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) [13:14:03] eoghan: I think that's the result of depooling eqiad to make sure that codfw could sustain the entire load (it apparently didn't) [13:14:05] akosiaris: it will recover [13:14:06] (03CR) 10CI reject: [V: 04-1] Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (owner: 10Andrew Bogott) [13:14:19] akosiaris: Good to know! [13:14:21] It's recovered anyway. [13:14:26] (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [13:14:28] eoghan: give it some time, sadly we are digging into fixing some things [13:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:39] (03PS3) 10Muehlenhoff: Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117 [13:14:44] effie: No problem! Good luck. Let us know if we can help [13:15:23] (03PS1) 10Jbond: update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210 [13:15:46] (03PS2) 10Jbond: update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210 [13:15:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] update color scheme [puppet] - 10https://gerrit.wikimedia.org/r/952210 (owner: 10Jbond) [13:15:52] (03CR) 10Muehlenhoff: Disable user creation on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (owner: 10Andrew Bogott) [13:16:42] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=kartotherian.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors [13:18:08] (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff) [13:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51325 and previous config saved to /var/cache/conftool/dbconfig/20230824-131952-ladsgroup.json [13:20:12] (03CR) 10Vgutierrez: [C: 03+1] "looks good, please take into account that data persistence and cloud services are also big users of HAProxy here." [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff) [13:22:03] (03PS2) 10Andrew Bogott: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 [13:22:10] (03PS2) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671) [13:23:05] !log disabling puppet and pybal on lvs1017 for reboot (T344587) [13:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:20] (03PS1) 10JMeybohm: jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) [13:24:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51326 and previous config saved to /var/cache/conftool/dbconfig/20230824-132504-ladsgroup.json [13:25:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:26:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51327 and previous config saved to /var/cache/conftool/dbconfig/20230824-132623-ladsgroup.json [13:26:32] (03PS1) 10JMeybohm: jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) [13:26:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed! Nicely spotted" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:26:49] (03PS4) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) [13:27:14] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:27:24] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:27:52] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:28:27] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:28:40] (03CR) 10JMeybohm: [C: 03+2] jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:28:44] (03CR) 10JMeybohm: [C: 03+2] jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:29:08] (03CR) 10CI reject: [V: 04-1] puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [13:29:10] (03CR) 10Muehlenhoff: [C: 03+2] Fail over URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/952117 (owner: 10Muehlenhoff) [13:29:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:29:37] (03Merged) 10jenkins-bot: jaeger: Run index cleaner daily not hourly [deployment-charts] - 10https://gerrit.wikimedia.org/r/952211 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:29:44] (03Merged) 10jenkins-bot: jeager: Rename default release from jaeger to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/952212 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:31:15] (03Abandoned) 10Jbond: (WIP) puppetdb-microservice: update puppetdb micro service so it streams data [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [13:32:46] (03PS5) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) [13:33:10] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:33:44] (03CR) 10Ssingh: wmf-config: remove public subnets from reverse-proxy.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [13:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51328 and previous config saved to /var/cache/conftool/dbconfig/20230824-133458-ladsgroup.json [13:35:07] (03CR) 10Jbond: [C: 03+2] puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [13:35:54] (03PS2) 10Ssingh: wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) [13:36:40] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/952213 [13:37:46] !log failover m2-master to dbproxy1023 [13:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:50] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/952213 (owner: 10Marostegui) [13:39:01] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Create an nftables::service when using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/952051 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:40:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P51329 and previous config saved to /var/cache/conftool/dbconfig/20230824-134010-ladsgroup.json [13:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51330 and previous config saved to /var/cache/conftool/dbconfig/20230824-134129-ladsgroup.json [13:41:31] (03CR) 10Muehlenhoff: [C: 03+2] profile::environment: Simplify environment variable export [puppet] - 10https://gerrit.wikimedia.org/r/952150 (owner: 10Muehlenhoff) [13:42:16] (03PS1) 10JMeybohm: jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) [13:42:31] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:42:33] (03Abandoned) 10Muehlenhoff: Adapt monitoring/metrics rules for nft and ferm providers [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:42:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:42:53] <_joe_> jouncebot: nowandnext [13:42:53] For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [13:42:54] For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [13:42:54] In 2 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600) [13:43:17] (03PS2) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) [13:43:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:43:50] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:43:55] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [13:43:56] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:44:18] (03PS1) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) [13:44:39] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis) [13:44:48] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:45:51] (03PS9) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [13:45:53] (03CR) 10JMeybohm: [C: 03+2] jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:46:02] (03PS3) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) [13:46:10] (03CR) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:46:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:46:27] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [13:46:28] !log cp3075: restart varnish frontend (changing malloc storage from https://gerrit.wikimedia.org/r/c/operations/puppet/+/952207/ ) [13:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] (03Merged) 10jenkins-bot: jeager: Fix secret name (generated by Certificate objects) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952214 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [13:47:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [13:47:24] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:47:34] (03PS1) 10Jbond: puppetdb-api-microservice: need to convert current query to json [puppet] - 10https://gerrit.wikimedia.org/r/952216 (https://phabricator.wikimedia.org/T342458) [13:47:39] (03PS2) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) [13:47:52] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:47:54] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [13:48:09] !log enabled puppet and pybal on lvs1017 (T344587) [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:13] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis) [13:48:24] (03CR) 10Jbond: [C: 03+2] puppetdb-api-microservice: need to convert current query to json [puppet] - 10https://gerrit.wikimedia.org/r/952216 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [13:48:48] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:48:54] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:50:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T344589)', diff saved to https://phabricator.wikimedia.org/P51331 and previous config saved to /var/cache/conftool/dbconfig/20230824-135004-ladsgroup.json [13:50:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:50:19] 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) a:03Jclark-ctr [13:50:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:50:39] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:50:40] (03CR) 10Jbond: [C: 03+1] firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:51:20] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:52:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis) [13:53:31] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:53:41] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:53:58] 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) Server is in a boot loop troubleshooting now [13:54:04] jouncebot: nowandnext [13:54:04] For the next 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [13:54:04] For the next 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1300) [13:54:04] In 2 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600) [13:54:19] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:54:22] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:54:23] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:54:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:54:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:54:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:54:31] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:54:32] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:54:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [13:54:34] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:54:35] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:54:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:54:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:54:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:54:56] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51332 and previous config saved to /var/cache/conftool/dbconfig/20230824-135456-ladsgroup.json [13:54:57] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:55:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) 05Resolved→03Open [13:55:10] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:55:10] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P51333 and previous config saved to /var/cache/conftool/dbconfig/20230824-135516-ladsgroup.json [13:55:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:55:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [13:55:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [13:55:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [13:55:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [13:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T344589)', diff saved to https://phabricator.wikimedia.org/P51334 and previous config saved to /var/cache/conftool/dbconfig/20230824-135636-ladsgroup.json [13:56:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:56:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51335 and previous config saved to /var/cache/conftool/dbconfig/20230824-135659-ladsgroup.json [13:58:47] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) Cool, thanks for the details, makes sens to use `prefix-limit` with `teardown` then, maybe some timeout so it automatically recovers and double check ou... [13:59:54] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:00:01] (03PS11) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:00:03] (03PS12) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:00:05] (03PS12) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:00:21] (03CR) 10Clément Goubert: [C: 03+1] mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197 (owner: 10Muehlenhoff) [14:00:53] (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:00:55] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:00:57] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51336 and previous config saved to /var/cache/conftool/dbconfig/20230824-140218-ladsgroup.json [14:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51337 and previous config saved to /var/cache/conftool/dbconfig/20230824-140226-ladsgroup.json [14:02:41] (03PS1) 10JMeybohm: jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) [14:02:56] (03PS1) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [14:02:59] (03PS12) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:03:01] (03PS13) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:03:03] (03PS13) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:03:14] (03PS2) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/937139 [14:03:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [14:03:46] (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:03:48] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:03:52] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:05:11] (03CR) 10Ssingh: "A bit split about this because while I think this is important, we also ran into a bunch of issues with the durum hosts. In any case, my v" [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [14:06:14] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [14:06:50] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10colewhite) [14:07:10] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) 05Open→03Resolved a:03colewhite [14:08:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:00] (03PS1) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 [14:09:41] (03PS2) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [14:10:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51338 and previous config saved to /var/cache/conftool/dbconfig/20230824-141022-ladsgroup.json [14:10:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:10:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:10:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51339 and previous config saved to /var/cache/conftool/dbconfig/20230824-141043-ladsgroup.json [14:10:47] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::php: Remove check [puppet] - 10https://gerrit.wikimedia.org/r/952197 (owner: 10Muehlenhoff) [14:11:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [14:11:41] (03CR) 10CI reject: [V: 04-1] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [14:12:53] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 [14:17:25] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51340 and previous config saved to /var/cache/conftool/dbconfig/20230824-141725-ladsgroup.json [14:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51341 and previous config saved to /var/cache/conftool/dbconfig/20230824-141733-ladsgroup.json [14:18:52] (03PS1) 10Ssingh: test_dns: add new DNS hosts in esams doh300[34] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/952225 [14:18:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:21] (03CR) 10Ssingh: [C: 03+2] test_dns: add new DNS hosts in esams doh300[34] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/952225 (owner: 10Ssingh) [14:21:46] !log installing openssl security updates on buster [14:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:39] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:esams and A:wikidough [14:25:07] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344872 (10Jhancock.wm) a:03Jhancock.wm [14:25:25] (03PS13) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:25:27] (03PS14) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:25:29] (03PS14) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:26:09] (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:26:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli) [14:26:17] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:26:26] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:27:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli) [14:28:24] (03CR) 10Jforrester: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [14:28:40] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:28:42] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P51342 and previous config saved to /var/cache/conftool/dbconfig/20230824-142900-ladsgroup.json [14:30:40] (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [14:30:42] (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [14:30:44] (03PS2) 10Jforrester: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [14:30:45] 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) Server is out of warranty. pulled dimm from recently decom server and replaced. A7. [14:31:20] 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Jclark-ctr) 05Open→03Resolved Server is back up and running [14:31:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:39] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930833 (owner: 10PipelineBot) [14:31:46] ^ expected [14:31:50] !log restarting FPM on mw canaries [14:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:11] 10SRE, 10ops-eqiad, 10DBA: db1178 didn't come back online after reboot - https://phabricator.wikimedia.org/T344880 (10Ladsgroup) Thanks for fast fix. I really appreciate it. [14:32:19] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935141 (owner: 10PipelineBot) [14:32:30] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935142 (owner: 10PipelineBot) [14:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51343 and previous config saved to /var/cache/conftool/dbconfig/20230824-143231-ladsgroup.json [14:32:35] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935132 (owner: 10PipelineBot) [14:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51344 and previous config saved to /var/cache/conftool/dbconfig/20230824-143239-ladsgroup.json [14:35:48] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:50] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) Turns out there is one more thing I need to do to. I missed a firmware update. Is it safe for me to reboot at this time? [14:38:03] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli) [14:38:47] (03Merged) 10jenkins-bot: tegola-vector-tiles: bump CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/952223 (owner: 10Effie Mouzeli) [14:38:52] (03PS1) 10Jelto: miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914) [14:39:36] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:40:16] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:07] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:43:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:47] (03PS10) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [14:43:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P51345 and previous config saved to /var/cache/conftool/dbconfig/20230824-144404-ladsgroup.json [14:44:19] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [14:44:32] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [14:47:28] (03Abandoned) 10Jdrewniak: Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [14:47:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T344589)', diff saved to https://phabricator.wikimedia.org/P51346 and previous config saved to /var/cache/conftool/dbconfig/20230824-144737-ladsgroup.json [14:47:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T344589)', diff saved to https://phabricator.wikimedia.org/P51347 and previous config saved to /var/cache/conftool/dbconfig/20230824-144745-ladsgroup.json [14:47:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:47:55] (03CR) 10Jelto: [C: 03+2] P:gitlab::runner: Do not schedule untagged jobs on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/952017 (https://phabricator.wikimedia.org/T344874) (owner: 10Dduvall) [14:47:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51348 and previous config saved to /var/cache/conftool/dbconfig/20230824-144801-ladsgroup.json [14:48:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:48:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51349 and previous config saved to /var/cache/conftool/dbconfig/20230824-144810-ladsgroup.json [14:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51350 and previous config saved to /var/cache/conftool/dbconfig/20230824-144903-ladsgroup.json [14:49:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:49:13] (03PS11) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [14:49:34] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951877 (owner: 10PipelineBot) [14:49:56] (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [14:50:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:46] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [14:52:04] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:52:40] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:52:56] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/934464 (owner: 10PipelineBot) [14:53:06] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935868 (owner: 10PipelineBot) [14:53:14] !log installing poppler security updates [14:53:16] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935882 (owner: 10PipelineBot) [14:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51351 and previous config saved to /var/cache/conftool/dbconfig/20230824-145317-ladsgroup.json [14:53:22] (03CR) 10JMeybohm: [C: 03+1] miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [14:53:26] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/939283 (owner: 10PipelineBot) [14:53:35] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/940220 (owner: 10PipelineBot) [14:53:41] (03CR) 10JMeybohm: [C: 03+2] jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:54:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet [14:54:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1019.eqiad.wmnet [14:54:25] (03Merged) 10jenkins-bot: jeager: Temporarily lower the lifetime of TLS certs to 2 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/952220 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:54:39] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/941912 (owner: 10PipelineBot) [14:54:46] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/942791 (owner: 10PipelineBot) [14:54:55] (03Abandoned) 10MSantos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945010 (owner: 10PipelineBot) [14:55:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51352 and previous config saved to /var/cache/conftool/dbconfig/20230824-145519-ladsgroup.json [14:55:26] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:55:33] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:56:02] (03PS14) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:56:05] (03PS15) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:56:07] (03PS15) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:56:16] (03PS2) 10Giuseppe Lavagetto: httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 [14:56:58] PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P51353 and previous config saved to /var/cache/conftool/dbconfig/20230824-145909-ladsgroup.json [15:00:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:00:53] (03CR) 10Kamila Součková: [C: 03+1] httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 (owner: 10Giuseppe Lavagetto) [15:01:15] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:02:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:esams and A:wikidough [15:02:23] !log pool kartotherian on codfw [15:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:59] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [15:03:04] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd: fix ecs logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952199 (owner: 10Giuseppe Lavagetto) [15:03:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1019.eqiad.wmnet [15:03:38] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet [15:04:09] (03CR) 10Btullis: wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [15:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P51354 and previous config saved to /var/cache/conftool/dbconfig/20230824-150410-ladsgroup.json [15:05:12] (03PS1) 10Ilias Sarantopoulos: ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895) [15:05:28] (03PS1) 10JMeybohm: jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253) [15:07:22] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:22] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:22] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:22] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:22] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:23] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:23] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:24] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:24] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:25] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:25] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:26] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:26] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:27] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:27] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:28] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:28] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:29] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:29] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:30] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:30] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:40] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:40] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:40] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:40] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:40] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:41] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:41] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:42] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:42] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:50] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:50] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:50] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:52] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:52] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:52] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:52] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:07:52] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:07:53] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:08] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:08:08] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:08:08] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:14] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:08:14] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:08:14] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:18] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobi [15:08:18] ons/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) is CRITICAL: Test Get med [15:08:18] from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storag https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51355 and previous config saved to /var/cache/conftool/dbconfig/20230824-150823-ladsgroup.json [15:09:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:09:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:10:03] (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51356 and previous config saved to /var/cache/conftool/dbconfig/20230824-151025-ladsgroup.json [15:10:27] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) [15:11:16] (03Merged) 10jenkins-bot: jaeger: Fix typo in secretName [deployment-charts] - 10https://gerrit.wikimedia.org/r/952231 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:11:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [15:11:30] urandom, herron ^^ around? [15:11:42] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:42] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:42] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:42] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:42] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:43] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:49] vgutierrez: yep [15:11:58] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:00] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:00] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:08] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:10] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:10] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:11] vgutierrez: aye [15:12:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) Here is the ssh public key generated from the new machine: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIASWBpeq1Ju1EHwv5Jd7aupwy787kls1Az2ffAPWIPfJ reb... [15:12:26] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:30] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:36] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:13:01] !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wcqs,name=eqiad [15:13:16] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:13:41] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:13:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:59] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P51357 and previous config saved to /var/cache/conftool/dbconfig/20230824-151414-ladsgroup.json [15:14:43] (03CR) 10Kamila Součková: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43014/console" [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [15:14:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:14:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:14:56] <_joe_> jouncebot: nowandnext [15:14:56] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [15:14:56] In 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600) [15:15:18] ok we have more traffic? [15:15:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:15:47] <_joe_> what is going on in production right now? [15:15:49] effie: not really.. at least not in terms of restbase [15:16:01] _joe_: I see more rps on apps [15:16:06] (03CR) 10Kamila Součková: [V: 03+1 C: 03+2] benthos/mw_accesslog_metrics: fix parsing [puppet] - 10https://gerrit.wikimedia.org/r/952239 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [15:16:08] but restbase started to return 500s by thousands [15:16:10] and I see many http. error on NEL [15:16:18] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase1020.eqiad.wmnet [15:16:22] <_joe_> effie: are you oncall right now? [15:16:27] _joe_: no [15:16:50] that's https://grafana.wikimedia.org/goto/6gp7_PgIz?orgId=1 [15:16:54] <_joe_> there's clearly something wrong [15:17:17] <_joe_> herron, urandom ^^ can you please check what is the pattern of reuqests? [15:17:55] <_joe_> the restbase thing seems resolved [15:18:07] we are recovering [15:18:11] (03CR) 10AikoChou: [C: 03+1] ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895) (owner: 10Ilias Sarantopoulos) [15:18:16] I was doing a rolling reboot of restbase servers [15:18:18] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:18:40] (03CR) 10JHathaway: [C: 03+2] puppetserver: ensure correct ordering when using an intermediate cert [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [15:18:40] urandom: uh :) [15:18:46] what that would have caused the high rate of 500s is puzzling though [15:18:53] <_joe_> so yeah, our baseline of requests is now 5k/s vs 3k/s before we repooled esams [15:18:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:19:05] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:19:06] <_joe_> vgutierrez: did we move de and uk back? [15:19:11] we didn't yet [15:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P51358 and previous config saved to /var/cache/conftool/dbconfig/20230824-151916-ladsgroup.json [15:19:19] <_joe_> I would suggest we don't [15:19:22] but the rates don't look as bad as yesterday [15:19:40] <_joe_> bblack: yesterday we had a scraper, today we're at a baseline of 5k rps on appservers right now [15:19:49] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=65&from=now-7d&to=now&refresh=1m [15:19:57] <_joe_> ah wait sorry, wrong graph [15:19:59] I'm just compariing to how things looked pre-esams. [15:20:02] <_joe_> yeah it's less severe [15:20:05] other than this spike just now [15:20:20] a few days back, eqiad was peaking ~3k ish, now like 3.3k? [15:20:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:30] <_joe_> yeah it's not that bad [15:20:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:20:39] !log oblivian@deploy1002 Started scap: (no justification provided) [15:20:57] <_joe_> this is not a true deployment, I'm just rebuilding the docker images [15:21:48] !log depool kartotherian on eqiad [15:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:22:17] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [15:22:44] (03PS1) 10Jdrewniak: watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870) [15:23:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51359 and previous config saved to /var/cache/conftool/dbconfig/20230824-152329-ladsgroup.json [15:23:46] looks like a complete drop off in requests to mobileapps from restbase in that last burst of errors: https://grafana-rw.wikimedia.org/d/5CmeRcnMz/mobileapps?forceLogin&from=now-30m&orgId=1&to=now&var-container_name=All&var-dc=thanos&var-prometheus=k8s&var-service=mobileapps&var-site=eqiad [15:24:02] well no, sorry, big spike in errors [15:25:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51360 and previous config saved to /var/cache/conftool/dbconfig/20230824-152531-ladsgroup.json [15:25:41] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:26:13] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:26:38] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:27:12] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:27:41] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T337991 (10DAlangi_WMF) [15:30:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [15:30:47] (03CR) 10Filippo Giunchedi: "To be merged once the ingress work is completed" [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [15:33:38] (03PS1) 10JMeybohm: PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952242 (https://phabricator.wikimedia.org/T344253) [15:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343718)', diff saved to https://phabricator.wikimedia.org/P51361 and previous config saved to /var/cache/conftool/dbconfig/20230824-153422-ladsgroup.json [15:34:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:34:28] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:34:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:34:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51362 and previous config saved to /var/cache/conftool/dbconfig/20230824-153443-ladsgroup.json [15:37:38] (03PS1) 10JMeybohm: PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) [15:38:05] (03CR) 10CI reject: [V: 04-1] PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:38:22] urandom: dunno if expected or not but per https://grafana.wikimedia.org/goto/NNe4XPRIk?orgId=1 metrics_edited-pages_aggregate_-project-_-editor-type-_-page-type-_-activity-level-_-granularity-_-start-_-end got super slow when you started rebooting servers [15:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T344589)', diff saved to https://phabricator.wikimedia.org/P51363 and previous config saved to /var/cache/conftool/dbconfig/20230824-153835-ladsgroup.json [15:38:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:38:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:39:38] (03CR) 10Bking: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [15:40:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43015/console" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:40:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T344589)', diff saved to https://phabricator.wikimedia.org/P51364 and previous config saved to /var/cache/conftool/dbconfig/20230824-154037-ladsgroup.json [15:40:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [15:40:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [15:41:03] (03PS1) 10JMeybohm: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) [15:41:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51365 and previous config saved to /var/cache/conftool/dbconfig/20230824-154102-ladsgroup.json [15:41:21] (03PS2) 10JMeybohm: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) [15:42:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:42:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51366 and previous config saved to /var/cache/conftool/dbconfig/20230824-154238-ladsgroup.json [15:43:01] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:44:10] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [15:45:31] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:45:34] (03PS2) 10JMeybohm: PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) [15:48:26] (03PS1) 10JHathaway: dev env: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/952245 (https://phabricator.wikimedia.org/T337970) [15:48:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51367 and previous config saved to /var/cache/conftool/dbconfig/20230824-154829-ladsgroup.json [15:49:13] (03CR) 10JHathaway: [C: 03+2] dev env: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/952245 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [15:49:45] (03CR) 10JMeybohm: "labs/private change is at https://gerrit.wikimedia.org/r/c/labs/private/+/952242" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:49:47] (03CR) 10Gmodena: [C: 03+2] data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [15:49:52] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952242 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:49:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51368 and previous config saved to /var/cache/conftool/dbconfig/20230824-154956-ladsgroup.json [15:50:40] (03PS1) 10JMeybohm: aux: Rename the aux profile to match the naming scheme [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253) [15:50:58] (03CR) 10Filippo Giunchedi: [C: 03+1] PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:51:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:51:03] (03Merged) 10jenkins-bot: data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [15:51:05] (03CR) 10JMeybohm: "This depends on I1b8896cfce4f8f07d979635beacdfd7fe90bd7ed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:51:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:54:09] (03PS1) 10Ssingh: lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 [15:55:08] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43016/console" [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh) [15:56:24] (03CR) 10JMeybohm: [C: 03+2] Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:56:58] (03PS2) 10Ssingh: lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 [15:57:05] (03Merged) 10jenkins-bot: Revert "jeager: Temporarily lower the lifetime of TLS certs to 2 days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952131 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:57:30] (03PS3) 10Btullis: Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671) [15:57:32] (03CR) 10BCornwall: [C: 03+1] lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh) [15:58:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43017/console" [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh) [16:00:04] jbond: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1600). Please do the needful. [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:09] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10Kizule) > What do you plan to use deployment access for? For deploying config patches from https://wikitech.wikimedia.org/wiki/Deployments. [16:00:18] !log disable puppet on A:lvs and A:esams to merge 952247 [16:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:21] (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs/esams: unify LVS hiera overrides for esams [puppet] - 10https://gerrit.wikimedia.org/r/952247 (owner: 10Ssingh) [16:01:11] (03CR) 10Btullis: [C: 03+2] Fail back hive to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/952194 (https://phabricator.wikimedia.org/T344671) (owner: 10Btullis) [16:01:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, small documentation comment inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:02:36] (03PS15) 10Giuseppe Lavagetto: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51369 and previous config saved to /var/cache/conftool/dbconfig/20230824-160335-ladsgroup.json [16:03:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:04:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Actually, your change is missing some changes that were added to networkpolicy 1.0.1 I think, you should backport it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:04:33] !log enable puppet on A:lvs and A:esams and force run agent to merge 952247 [16:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20230824-160502-ladsgroup.json [16:05:11] (03PS1) 10Jbond: pupetdb: add netbox::standalone to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/952251 [16:05:29] (03Abandoned) 10Btullis: query_service: let puppet manage whitelist [puppet] - 10https://gerrit.wikimedia.org/r/952215 (https://phabricator.wikimedia.org/T343856) (owner: 10Btullis) [16:06:19] (03CR) 10Jbond: [C: 03+2] pupetdb: add netbox::standalone to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/952251 (owner: 10Jbond) [16:08:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:09:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:10:17] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=eqiad [16:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51371 and previous config saved to /var/cache/conftool/dbconfig/20230824-161050-ladsgroup.json [16:10:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:11:28] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10BCornwall) After some discussion in our last ONFIRE meeting it appears that our most basic needs comprise of: 1. A real-time editor for in-the-moment information... [16:12:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1001.eqiad.wmnet [16:13:27] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:15:50] (03CR) 10Btullis: [C: 03+1] "Let's try it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [16:17:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43018/console" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [16:18:27] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:18:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51372 and previous config saved to /var/cache/conftool/dbconfig/20230824-161841-ladsgroup.json [16:20:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P51373 and previous config saved to /var/cache/conftool/dbconfig/20230824-162013-ladsgroup.json [16:24:41] (03CR) 10Stevemunene: [C: 03+2] datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [16:25:33] (03Merged) 10jenkins-bot: datahub: set preferred oidc jwt algotithm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952208 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [16:25:44] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [16:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P51374 and previous config saved to /var/cache/conftool/dbconfig/20230824-162556-ladsgroup.json [16:27:53] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:28:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:28:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:29:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:30:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:30:58] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:33:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T344589)', diff saved to https://phabricator.wikimedia.org/P51375 and previous config saved to /var/cache/conftool/dbconfig/20230824-163347-ladsgroup.json [16:33:55] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) Benthos is deployed and producing metrics, but I am not closing this yet, because the logs contain quite a lot of e... [16:33:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:34:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:34:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:34:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51376 and previous config saved to /var/cache/conftool/dbconfig/20230824-163419-ladsgroup.json [16:35:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1001.eqiad.wmnet [16:35:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T344589)', diff saved to https://phabricator.wikimedia.org/P51377 and previous config saved to /var/cache/conftool/dbconfig/20230824-163519-ladsgroup.json [16:35:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:35:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:35:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51378 and previous config saved to /var/cache/conftool/dbconfig/20230824-163543-ladsgroup.json [16:38:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [16:39:02] (03PS4) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [16:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P51380 and previous config saved to /var/cache/conftool/dbconfig/20230824-164103-ladsgroup.json [16:41:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51381 and previous config saved to /var/cache/conftool/dbconfig/20230824-164140-ladsgroup.json [16:43:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51382 and previous config saved to /var/cache/conftool/dbconfig/20230824-164301-ladsgroup.json [16:48:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2025'] [16:48:40] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron) [16:49:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2025'] [16:49:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2025'] [16:49:30] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10herron) [16:50:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [16:52:17] (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254 [16:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343718)', diff saved to https://phabricator.wikimedia.org/P51383 and previous config saved to /var/cache/conftool/dbconfig/20230824-165609-ladsgroup.json [16:56:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:56:41] (03CR) 10Dduvall: "Thanks for the review, Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [16:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51384 and previous config saved to /var/cache/conftool/dbconfig/20230824-165646-ladsgroup.json [16:57:25] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255 [16:58:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51385 and previous config saved to /var/cache/conftool/dbconfig/20230824-165807-ladsgroup.json [16:59:53] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254 (owner: 10BryanDavis) [17:00:06] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1700) [17:00:26] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255 (owner: 10BryanDavis) [17:00:36] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-08-21-195715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952254 (owner: 10BryanDavis) [17:01:13] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-08-21-112124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952255 (owner: 10BryanDavis) [17:01:36] I will be deploying both toolhub and developer-portal in today's window (which I probably should rename now that Tech Engagement is gone) [17:05:18] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:06:43] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:07:03] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [17:08:15] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:08:21] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [17:08:57] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [17:10:03] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [17:10:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:10:59] !log Toolhub updated to a59d37 [17:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:04] (03PS1) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [17:11:33] (03CR) 10CI reject: [V: 04-1] [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:11:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51386 and previous config saved to /var/cache/conftool/dbconfig/20230824-171152-ladsgroup.json [17:11:53] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:12:16] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:12:23] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:12:43] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:12:49] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:13:09] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:13:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51387 and previous config saved to /var/cache/conftool/dbconfig/20230824-171314-ladsgroup.json [17:15:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:17:20] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@15ed2de]: (no justification provided) [17:17:40] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@15ed2de]: (no justification provided) (duration: 00m 19s) [17:21:45] (03PS2) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [17:22:44] 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10thcipriani) >>! In T342755#9115945, @fgiunchedi wrote: >>>! In T342755#9114368, @thcipriani wrote: >> Hrm. We get an email from the systemd timer for this, so the alert is probabl... [17:23:36] !log [WCQS] T344882 `ryankemper@wcqs1003:~$ sudo depool` [17:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:42] T344882: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 [17:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51388 and previous config saved to /var/cache/conftool/dbconfig/20230824-172658-ladsgroup.json [17:27:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [17:27:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [17:27:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51389 and previous config saved to /var/cache/conftool/dbconfig/20230824-172723-ladsgroup.json [17:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T344589)', diff saved to https://phabricator.wikimedia.org/P51390 and previous config saved to /var/cache/conftool/dbconfig/20230824-172820-ladsgroup.json [17:28:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:28:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:28:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:28:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51391 and previous config saved to /var/cache/conftool/dbconfig/20230824-172851-ladsgroup.json [17:30:17] (03CR) 10Bking: [C: 03+2] spdx.rb: Skip SPDX enforcement of txt files [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291) (owner: 10Bking) [17:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51393 and previous config saved to /var/cache/conftool/dbconfig/20230824-173448-ladsgroup.json [17:34:54] (03PS1) 10Krinkle: Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) [17:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51394 and previous config saved to /var/cache/conftool/dbconfig/20230824-173609-ladsgroup.json [17:36:25] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:39:07] (03PS3) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [17:40:00] (03PS1) 10Cwhite: logstash: move error to error.message when it is a string [puppet] - 10https://gerrit.wikimedia.org/r/951881 (https://phabricator.wikimedia.org/T276468) [17:46:27] (03PS4) 10FNegri: [openstack] automatic file duplication for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/952259 (https://phabricator.wikimedia.org/T341285) [17:48:44] (03PS3) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [17:48:58] (03CR) 10CI reject: [V: 04-1] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [17:48:58] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:53] (03PS4) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [17:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51395 and previous config saved to /var/cache/conftool/dbconfig/20230824-174954-ladsgroup.json [17:50:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [17:51:04] (03PS5) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [17:51:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51396 and previous config saved to /var/cache/conftool/dbconfig/20230824-175115-ladsgroup.json [17:53:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:55:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:00:05] dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T1800). [18:05:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51397 and previous config saved to /var/cache/conftool/dbconfig/20230824-180500-ladsgroup.json [18:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51398 and previous config saved to /var/cache/conftool/dbconfig/20230824-180621-ladsgroup.json [18:08:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:18:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T344589)', diff saved to https://phabricator.wikimedia.org/P51399 and previous config saved to /var/cache/conftool/dbconfig/20230824-182006-ladsgroup.json [18:20:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [18:20:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [18:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51400 and previous config saved to /var/cache/conftool/dbconfig/20230824-182032-ladsgroup.json [18:20:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T344589)', diff saved to https://phabricator.wikimedia.org/P51401 and previous config saved to /var/cache/conftool/dbconfig/20230824-182128-ladsgroup.json [18:21:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [18:21:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [18:21:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51402 and previous config saved to /var/cache/conftool/dbconfig/20230824-182151-ladsgroup.json [18:26:56] (03PS6) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [18:28:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51403 and previous config saved to /var/cache/conftool/dbconfig/20230824-182802-ladsgroup.json [18:28:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:29:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) pc1015 A6 U33 pc1016. C6 U31 [18:29:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) a:03Jclark-ctr [18:31:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:35:41] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51404 and previous config saved to /var/cache/conftool/dbconfig/20230824-184308-ladsgroup.json [18:46:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:48:35] (03PS7) 10Bking: wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) [18:49:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51405 and previous config saved to /var/cache/conftool/dbconfig/20230824-184915-ladsgroup.json [18:49:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:50:01] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:39] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) wdqs1017 D2. U38 wdqs1018 E2 U40 wdqs1019. F2. U39 [18:51:43] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43019/console" [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:52:08] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:52:12] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [18:52:57] (03CR) 10Bking: [C: 03+2] wdqs: Add allowlist.txt [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:53:14] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) a:03Jclark-ctr [18:53:21] (03CR) 10Btullis: wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:53:23] (03CR) 10Bking: [C: 03+2] wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [18:54:52] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725) [18:54:54] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:55:41] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952261 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:58:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51406 and previous config saved to /var/cache/conftool/dbconfig/20230824-185816-ladsgroup.json [18:58:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:01:40] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T2000" [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle) [19:03:18] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.23 refs T343725 [19:03:23] T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725 [19:03:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51407 and previous config saved to /var/cache/conftool/dbconfig/20230824-190422-ladsgroup.json [19:08:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:08:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:09:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:13:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T344589)', diff saved to https://phabricator.wikimedia.org/P51408 and previous config saved to /var/cache/conftool/dbconfig/20230824-191322-ladsgroup.json [19:14:03] (03CR) 10Btullis: [C: 03+1] wdqs: Add allowlist.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952221 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [19:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51409 and previous config saved to /var/cache/conftool/dbconfig/20230824-191928-ladsgroup.json [19:22:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:27:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:30:49] !log pool kartotherian to eqiad and depool from codfw [19:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:58] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [19:34:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T344589)', diff saved to https://phabricator.wikimedia.org/P51410 and previous config saved to /var/cache/conftool/dbconfig/20230824-193434-ladsgroup.json [19:34:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [19:34:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [19:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51411 and previous config saved to /var/cache/conftool/dbconfig/20230824-193458-ladsgroup.json [19:37:09] 10SRE, 10Traffic, 10observability: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000 (10BCornwall) I'm not sure that a smaller period does fix things. Attached is a 5m and 2m. Switching to irate() is showing similar things, too. {F37627399} {F37627398} [19:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51412 and previous config saved to /var/cache/conftool/dbconfig/20230824-194317-ladsgroup.json [19:55:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51414 and previous config saved to /var/cache/conftool/dbconfig/20230824-195823-ladsgroup.json [20:00:04] brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230824T2000). [20:00:04] jan_drewniak and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:00:56] hi [20:01:01] o/ [20:03:05] (03PS1) 10Effie Mouzeli: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/952133 [20:03:35] !log enabling puppet on thanos-fe* hosts [20:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:45] o/ [20:04:06] MatmaRex, jan_drewniak: i'm sort of pressed for time at the moment but let's see what we can do. [20:04:14] I can deploy here in a sec, too [20:04:19] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/952133 (owner: 10Effie Mouzeli) [20:06:19] PROBLEM - Host logstash1037 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:23] RECOVERY - Host logstash1037 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:07:05] (03CR) 10Thcipriani: [C: 03+2] watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870) (owner: 10Jdrewniak) [20:07:36] I'll get jenkins going for the tests that take a while [20:08:11] (03CR) 10Thcipriani: [C: 03+2] Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle) [20:08:22] and let's do the config in the interim [20:08:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:09:08] (03PS3) 10Thcipriani: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński) [20:09:33] MatmaRex: since you put this up for deploy (and it *is* next week), assuming your -1 is null :) [20:10:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński) [20:10:56] thcipriani: yes, sorry :) [20:11:03] (03Merged) 10jenkins-bot: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński) [20:11:22] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]] [20:11:30] T341618: Remove deprecated RESTBase-related VE config settings - https://phabricator.wikimedia.org/T341618 [20:12:52] !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:13:23] ^ MatmaRex anything to test? not exploding the test since these are "unused"? [20:13:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:13:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51415 and previous config saved to /var/cache/conftool/dbconfig/20230824-201329-ladsgroup.json [20:13:31] hi, I have a small security patch that would be nice to deploy, if there's time in this window. [20:14:27] thcipriani: yeah, nothing specific to test [20:14:27] kostajh: there's probably room for it, do you need me to deploy or are you able to deploy (I forget)? [20:14:54] the visual editor still loads [20:15:10] I'm able to deploy, but I'm not as familiar with syncing security patches so would prefer if someone else with more experience could do it [20:15:43] PROBLEM - puppet last run on thanos-fe1001 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:15:47] MatmaRex: just ran the same test :D Thanks for confirming, going live [20:15:55] !log thcipriani@deploy1002 thcipriani and matmarex: Continuing with sync [20:16:24] kostajh: happy to deploy, wanna DM me details? [20:16:38] sure, thank you! [20:17:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:19:47] PROBLEM - puppet last run on thanos-fe1002 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:21:07] RECOVERY - puppet last run on thanos-fe1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:21:20] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:949593|Remove unused RESTBase-related VisualEditor config settings (T341618)]] (duration: 09m 58s) [20:21:25] T341618: Remove deprecated RESTBase-related VE config settings - https://phabricator.wikimedia.org/T341618 [20:21:28] ^ MatmaRex live now [20:21:35] (03Merged) 10jenkins-bot: watchlist: Don't assume only named users have watchlist access [skins/MinervaNeue] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952130 (https://phabricator.wikimedia.org/T344870) (owner: 10Jdrewniak) [20:21:37] (03Merged) 10jenkins-bot: Add option to just create the 'Global rename script' system user [extensions/CentralAuth] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/952132 (https://phabricator.wikimedia.org/T344632) (owner: 10Krinkle) [20:21:53] PROBLEM - puppet last run on thanos-fe1003 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:22:10] thanks [20:22:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:59] PROBLEM - puppet last run on thanos-fe1004 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:25:27] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:27:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T344589)', diff saved to https://phabricator.wikimedia.org/P51416 and previous config saved to /var/cache/conftool/dbconfig/20230824-202836-ladsgroup.json [20:28:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [20:28:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [20:29:18] kostajh: going to sling out your security patch, then MatmaRex: and jan_drewniak your sync is going out together since one is a maintenance script [20:29:40] cool, thanks [20:29:43] +1 [20:29:47] thcipriani: thanks! [20:30:27] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:33:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:33:13] !log bking@deploy1002 Started deploy [wdqs/wdqs@2455ffd]: (no justification provided) [20:33:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51417 and previous config saved to /var/cache/conftool/dbconfig/20230824-203322-ladsgroup.json [20:34:26] !log bking@deploy1002 'scap deploy new wdqs T343856' [20:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:33] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [20:35:27] (RedisMemoryFull) resolved: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:37:55] !log bking@deploy1002 Finished deploy [wdqs/wdqs@2455ffd]: (no justification provided) (duration: 04m 41s) [20:39:21] RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:40:05] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 299 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:40:23] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:40:23] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51418 and previous config saved to /var/cache/conftool/dbconfig/20230824-204035-ladsgroup.json [20:41:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:29] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:43:58] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 [20:44:04] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [20:44:11] (SystemdUnitFailed) firing: nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:18] alright. security patch slung. I'll move on to others. [20:44:38] thanks! [20:45:47] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]] [20:45:53] T344870: MinervaNeue: Watchstar missing for anonymous users - https://phabricator.wikimedia.org/T344870 [20:45:54] T344632: Unable to inspect Global rename script log entries on enwiki - https://phabricator.wikimedia.org/T344632 [20:46:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:47:14] !log thcipriani@deploy1002 thcipriani and jdrewniak and krinkle: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment ( [20:47:14] accessible via k8s-experimental XWD option) [20:47:46] ^ jan_drewniak your change is live on mwdebug boxen, check please [20:48:42] thcipriani: perfect, thanks! [20:51:53] jan_drewniak: does that mean you tested and it looks perfect? [20:52:20] thcipriani: yes it does :) [20:52:34] ah, ok :D [20:52:42] going live now [20:52:57] !log thcipriani@deploy1002 thcipriani and jdrewniak and krinkle: Continuing with sync [20:53:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:53:53] (SystemdUnitFailed) firing: (3) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:35] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:54:45] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:13] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:55:17] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:55:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51419 and previous config saved to /var/cache/conftool/dbconfig/20230824-205541-ladsgroup.json [20:58:19] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:952132|Add option to just create the 'Global rename script' system user (T344632)]], [[gerrit:952130|watchlist: Don't assume only named users have watchlist access (T344870)]] (duration: 12m 31s) [20:58:25] T344870: MinervaNeue: Watchstar missing for anonymous users - https://phabricator.wikimedia.org/T344870 [20:58:25] T344632: Unable to inspect Global rename script log entries on enwiki - https://phabricator.wikimedia.org/T344632 [20:58:32] ^ jan_drewniak MatmaRex all sync'd now [20:58:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:58:46] MatmaRex: do you need me to run this maintenance script? [20:58:49] thcipriani: thanks. do we have time to run the script too? it should only take a few seconds [20:58:58] (SystemdUnitFailed) firing: (4) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:03] ah, cool, sure, lemme login to mwmaint [20:59:27] on all wikis: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --create-system-user [20:59:43] thank you [20:59:52] so foreachwiki is the right thing, correct? [21:00:24] i think so [21:01:28] !log mwmaint1002:foreachwiki extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --create-system-user # ref. 952132 [21:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:03:27] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:03:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:04:27] MatmaRex: it's going, in the k's now. I'll let you know when it's complete. Got a few "CentralAuth must be enabled. try again" type messages, but nothing else really. [21:05:12] ah. i was hoping it'd really be a couple seconds, but i guess just starting the scripts is slower than i thought [21:06:02] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 22m 03s) [21:06:07] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [21:06:14] MatmaRex: done now [21:06:25] RECOVERY - puppet last run on thanos-fe1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:06:29] thanks thcipriani. sorry for running over [21:06:44] it worked as expected, this shows up now: https://en.wikipedia.org/wiki/Special:Log/Global_rename_script [21:06:47] \o/ [21:06:52] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 [21:07:15] kudos, alright, calling window complete! Thanks all. [21:08:59] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:48] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 02m 56s) [21:10:45] RECOVERY - puppet last run on thanos-fe1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:10:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P51421 and previous config saved to /var/cache/conftool/dbconfig/20230824-211048-ladsgroup.json [21:11:44] (SystemdUnitCrashLoop) firing: wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:13:27] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:13:53] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:03] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: (no justification provided) [21:14:05] RECOVERY - puppet last run on thanos-fe1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:14:43] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: (no justification provided) (duration: 00m 40s) [21:14:50] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: (no justification provided) [21:15:45] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: (no justification provided) (duration: 00m 55s) [21:16:02] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 [21:16:06] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [21:17:30] (03PS1) 10Ryan Kemper: wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518) [21:18:15] (03CR) 10Bking: [C: 03+1] wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518) (owner: 10Ryan Kemper) [21:18:19] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 02m 17s) [21:18:24] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: disable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/952278 (https://phabricator.wikimedia.org/T344518) (owner: 10Ryan Kemper) [21:18:27] (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:18:53] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:55] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 [21:19:59] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:27] (RedisMemoryFull) firing: Redis memory full on rdb2009:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_misc&var-instance=rdb2009:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:21:44] (SystemdUnitCrashLoop) resolved: wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:23:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [21:23:17] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [21:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T344589)', diff saved to https://phabricator.wikimedia.org/P51422 and previous config saved to /var/cache/conftool/dbconfig/20230824-212554-ladsgroup.json [21:26:27] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:28:13] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 08m 18s) [21:28:18] T343856: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 [21:28:31] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:28:53] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:28:59] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2007 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:29:15] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:29:16] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 [21:29:17] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:29:31] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: allow list changes T343856 0.3.125 (duration: 00m 15s) [21:29:59] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:41] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:53] (SystemdUnitFailed) firing: (6) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [21:38:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [21:38:53] (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:39:59] (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:43:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [21:43:13] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:43:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [21:43:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [21:43:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [21:43:33] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:43:35] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:59] (SystemdUnitFailed) firing: (7) nginx.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:47:48] (03PS1) 10Cwhite: grafana: ensure prometheus/global instances removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) [21:48:29] (03PS2) 10Cwhite: grafana: ensure prometheus/global datasources removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) [21:59:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [21:59:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [22:11:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:15:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [22:15:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [22:16:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:18:53] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:41:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:42:15] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:49:31] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:04:58] (03PS2) 10BBlack: Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi) [23:06:19] (03CR) 10Ssingh: [C: 03+1] Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi) [23:08:24] (03CR) 10BBlack: [C: 03+2] Revert "Send germany and UK to drmrs" [dns] - 10https://gerrit.wikimedia.org/r/951486 (owner: 10Ayounsi) [23:09:47] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:06] !log geodns: DE+GB mapped back to esams (were temporarily on drmrs) [23:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:19:57] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:27] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:23:15] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:53] (SystemdUnitFailed) firing: (2) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:27] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:31:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:46:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull