[00:25:01] (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:31:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382 [00:39:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382 (owner: 10TrainBranchBot) [00:59:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336720 (10phaultfinder) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0200) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215) [02:08:05] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:21:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:02] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:55] (03CR) 10TChin: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0300) [03:01:08] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:30] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215) [03:01:32] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [03:02:14] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [03:02:46] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.9 refs T330215 [03:02:51] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [03:10:08] RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:34] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.9 refs T330215 (duration: 48m 47s) [03:51:39] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [03:54:03] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.6, 1.41.0-wmf.7 (duration: 02m 26s) [04:15:01] (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:18:04] RECOVERY - dump of backup1-codfw in codfw on backupmon1001 is OK: Last dump for backup1-codfw at codfw (db2184) taken on 2023-05-16 03:53:29 (15 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:08:55] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Marostegui) [05:10:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) Thanks @Jclark-ctr @jcrespo can you take care of putting this host back in service as it is a backup source one? [05:20:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 T336337', diff saved to https://phabricator.wikimedia.org/P48236 and previous config saved to /var/cache/conftool/dbconfig/20230516-052014-root.json [05:20:19] T336337: Failover s4 sanitarium master - https://phabricator.wikimedia.org/T336337 [05:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1221 T336337', diff saved to https://phabricator.wikimedia.org/P48237 and previous config saved to /var/cache/conftool/dbconfig/20230516-052026-root.json [05:24:29] (03PS1) 10Marostegui: db1121: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/920135 (https://phabricator.wikimedia.org/T336337) [05:25:42] (03CR) 10Marostegui: [C: 03+2] db1121: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/920135 (https://phabricator.wikimedia.org/T336337) (owner: 10Marostegui) [05:27:38] (03PS1) 10Marostegui: site.pp: Update sanitarium master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/920137 (https://phabricator.wikimedia.org/T336337) [05:28:57] (03CR) 10Marostegui: [C: 03+2] site.pp: Update sanitarium master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/920137 (https://phabricator.wikimedia.org/T336337) (owner: 10Marostegui) [05:29:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48238 and previous config saved to /var/cache/conftool/dbconfig/20230516-052920-root.json [05:29:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48239 and previous config saved to /var/cache/conftool/dbconfig/20230516-052936-root.json [05:30:24] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Marostegui) [05:32:12] (03PS1) 10Marostegui: ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139 [05:33:09] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139 (owner: 10Marostegui) [05:33:11] (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920140 [05:33:54] (03Merged) 10jenkins-bot: ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139 (owner: 10Marostegui) [05:35:32] (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920140 (owner: 10Marostegui) [05:36:32] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]] [05:38:05] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [05:43:47] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]] (duration: 07m 15s) [05:44:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48240 and previous config saved to /var/cache/conftool/dbconfig/20230516-054425-root.json [05:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48241 and previous config saved to /var/cache/conftool/dbconfig/20230516-054441-root.json [05:44:51] (03PS1) 10Marostegui: Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/919322 [05:45:04] (03PS1) 10Marostegui: Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323 [05:51:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 T336332', diff saved to https://phabricator.wikimedia.org/P48242 and previous config saved to /var/cache/conftool/dbconfig/20230516-055122-root.json [05:51:28] T336332: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 [05:52:32] (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920142 (https://phabricator.wikimedia.org/T336332) [05:53:03] (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920142 (https://phabricator.wikimedia.org/T336332) (owner: 10Marostegui) [05:53:30] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/919322 (owner: 10Marostegui) [05:53:41] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323 (owner: 10Marostegui) [05:54:29] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323 (owner: 10Marostegui) [05:58:07] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]] [05:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48243 and previous config saved to /var/cache/conftool/dbconfig/20230516-055929-root.json [05:59:36] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [05:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48244 and previous config saved to /var/cache/conftool/dbconfig/20230516-055946-root.json [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0600). [06:02:15] (03PS1) 10Marostegui: pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/920143 [06:05:28] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]] (duration: 07m 21s) [06:09:06] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/920143 (owner: 10Marostegui) [06:12:34] (03PS1) 10Marostegui: pc1014: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/920146 [06:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:13:34] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147 [06:14:02] (03CR) 10Marostegui: [C: 03+2] pc1014: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/920146 (owner: 10Marostegui) [06:14:17] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147 (owner: 10Marostegui) [06:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48245 and previous config saved to /var/cache/conftool/dbconfig/20230516-061434-root.json [06:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48246 and previous config saved to /var/cache/conftool/dbconfig/20230516-061450-root.json [06:15:04] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147 (owner: 10Marostegui) [06:17:26] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]] [06:18:51] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [06:24:34] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 07m 08s) [06:25:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48247 and previous config saved to /var/cache/conftool/dbconfig/20230516-062939-root.json [06:29:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48248 and previous config saved to /var/cache/conftool/dbconfig/20230516-062955-root.json [06:30:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:31:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [06:33:31] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324 [06:33:46] (03PS1) 10Marostegui: Revert "pc1014: Make it pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/919325 [06:40:33] (03CR) 10Slyngshede: [C: 03+2] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [06:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48249 and previous config saved to /var/cache/conftool/dbconfig/20230516-064444-root.json [06:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48250 and previous config saved to /var/cache/conftool/dbconfig/20230516-064500-root.json [06:46:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: reduce max_requests_per_connection to 100 [puppet] - 10https://gerrit.wikimedia.org/r/919262 (https://phabricator.wikimedia.org/T336554) (owner: 10Giuseppe Lavagetto) [06:47:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:41] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324 (owner: 10Marostegui) [06:49:07] <_joe_> !log running docker image prune -a in build2001 [06:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:24] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324 (owner: 10Marostegui) [06:49:54] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] [06:51:26] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [06:52:01] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Make it pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/919325 (owner: 10Marostegui) [06:56:52] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 06m 58s) [06:57:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48251 and previous config saved to /var/cache/conftool/dbconfig/20230516-065948-root.json [07:00:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48252 and previous config saved to /var/cache/conftool/dbconfig/20230516-070005-root.json [07:00:06] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0700) [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:07:20] (03PS1) 10Slyngshede: k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) [07:09:53] (03CR) 10CI reject: [V: 04-1] k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [07:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48253 and previous config saved to /var/cache/conftool/dbconfig/20230516-071453-root.json [07:15:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48254 and previous config saved to /var/cache/conftool/dbconfig/20230516-071509-root.json [07:15:35] (03PS2) 10Slyngshede: k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) [07:16:06] (03CR) 10Muehlenhoff: Obsolete profile::python37 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917813 (owner: 10Muehlenhoff) [07:16:30] (03PS1) 10Marostegui: install_server: Do not reimage db1220 [puppet] - 10https://gerrit.wikimedia.org/r/920193 [07:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:51] !log restart vopsbot.service on alert1001 [07:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:28] (03PS1) 10Marostegui: production-m5.sql: Add ipoid grants [puppet] - 10https://gerrit.wikimedia.org/r/920194 (https://phabricator.wikimedia.org/T305114) [07:31:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) I was working on it already :-D, was going to notify when completed, as it has 3 sections and I have so far only loaded back 2. [07:31:35] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1220 [puppet] - 10https://gerrit.wikimedia.org/r/920193 (owner: 10Marostegui) [07:33:07] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) [07:34:53] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect - Tele2, AS1257/IPv4: Active - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:38:46] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) \o/ [07:40:51] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41194/console" [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [07:42:16] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) [07:43:41] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [07:44:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41195/console" [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto) [07:44:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: run backup sync and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [07:45:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto) [07:52:15] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [07:54:02] (03CR) 10David Caro: [C: 03+1] "LGTM" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis) [07:58:01] (NodeTextfileStale) resolved: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:59:09] 10SRE-swift-storage: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 (10MatthewVernon) 05Open→03Resolved [this was resolved back in February - we moved the two thanos backends back into service and added on delaycompress] [08:12:08] (03PS1) 10Filippo Giunchedi: sre: disable pint promql/series check for SystemdUnitFailed [alerts] - 10https://gerrit.wikimedia.org/r/920199 (https://phabricator.wikimedia.org/T309182) [08:12:17] (03CR) 10JMeybohm: [C: 03+1] New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [08:14:47] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:15:01] (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:16:41] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:17:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy2003.codfw.wmnet with reason: Maintenance [08:17:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy2003.codfw.wmnet with reason: Maintenance [08:17:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy2004.codfw.wmnet with reason: Maintenance [08:18:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy2004.codfw.wmnet with reason: Maintenance [08:18:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance [08:18:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance [08:18:27] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica2006.wikimedia.org [08:19:17] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:21:47] (03CR) 10Jaime Nuche: "Thank you for the merge!" [labs/private] - 10https://gerrit.wikimedia.org/r/919833 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [08:23:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es[2023-2025].codfw.wmnet with reason: maintenance [08:23:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es[2023-2025].codfw.wmnet with reason: maintenance [08:24:13] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:25:18] (03CR) 10Jaime Nuche: "Thank a lot for the fix, merging and monitoring. I really appreciate all the effort." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [08:26:09] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:20] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: disable pint promql/series check for SystemdUnitFailed [alerts] - 10https://gerrit.wikimedia.org/r/920199 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:28:27] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Switch cp4052 to HAProxy 2.7 branch [puppet] - 10https://gerrit.wikimedia.org/r/919862 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [08:33:25] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) Thanks @Dzahn - That's a useful reference. I've created two user accounts in Matomo for `twi... [08:33:29] (03PS1) 10Filippo Giunchedi: o11y: ignore promql/series for code/thanos-query-frontend [alerts] - 10https://gerrit.wikimedia.org/r/920201 (https://phabricator.wikimedia.org/T309182) [08:33:51] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:23] PROBLEM - haproxy process on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [08:35:33] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:35:37] PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:42] ^^ cp4052 is me and it's currently depooled [08:36:50] (03PS1) 10Mvolz: Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727) [08:36:57] RECOVERY - haproxy process on cp4052 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [08:37:03] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: ignore promql/series for code/thanos-query-frontend [alerts] - 10https://gerrit.wikimedia.org/r/920201 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:37:07] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 426172 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-07-23 06:25:44 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:37:13] RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:37] PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:38:41] (03PS2) 10Jcrespo: Revert "db1225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918446 [08:39:08] (03CR) 10Jcrespo: "We are ready to get db1225 into production" [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo) [08:40:07] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10MoritzMuehlenhoff) [08:42:48] 10SRE, 10Infrastructure-Foundations, 10User-Kormat: debdeploy skipped hosts and assumed they're up to date(?) - https://phabricator.wikimedia.org/T268735 (10MoritzMuehlenhoff) 05Open→03Declined Old task, no longer really actionable at this point and this hasn't been seen since then. [08:43:00] (03CR) 10Jcrespo: [C: 03+2] Revert "db1225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo) [08:43:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es2033.codfw.wmnet with reason: Maintenance [08:43:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es2033.codfw.wmnet with reason: Maintenance [08:43:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es2034.codfw.wmnet with reason: Maintenance [08:44:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es2034.codfw.wmnet with reason: Maintenance [08:46:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) All 3 sections loaded and replicating, I have reverted the notifications disabled patch. All done. [08:49:20] (03PS1) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) [08:49:31] (03PS1) 10Muehlenhoff: Temporary drop krb1001 from KDC list used by clients [puppet] - 10https://gerrit.wikimedia.org/r/920204 (https://phabricator.wikimedia.org/T331695) [08:49:45] (03PS1) 10Filippo Giunchedi: dcops: temp disable promql/series pint check for InterfaceSpeedError [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182) [08:50:59] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10OlafJanssen) >>! In T331356#8849873, @Ladsgroup wrote: > Until it gets changed to HTTPS, basically we have two options: > - Remove the l... [08:52:49] (03PS1) 10Effie Mouzeli: php-multiversion-base: add rsvg-convert [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920206 (https://phabricator.wikimedia.org/T336025) [08:59:01] (03PS1) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) [08:59:08] (03PS3) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) [08:59:10] (03PS4) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) [09:01:05] (03PS1) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 [09:01:51] (03CR) 10CI reject: [V: 04-1] admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (owner: 10Klausman) [09:04:40] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: add rsvg-convert [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920206 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli) [09:06:44] (03PS5) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [09:08:37] (03PS2) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 [09:11:43] PROBLEM - Check systemd state on es1020 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:45] (03PS3) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) [09:16:27] (03PS1) 10Fabfur: admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 [09:16:29] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [09:20:16] jouncebot: nowandnext [09:20:16] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [09:20:17] In 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1000) [09:21:09] !log Optimize s5 on dbstore1003 T336733 [09:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:14] T336733: dbstore1003 filling up - https://phabricator.wikimedia.org/T336733 [09:21:22] (03PS2) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) [09:22:23] (03PS1) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) [09:23:02] !log jnuche@deploy1002 Installing scap version "4.52.2" for 595 hosts [09:23:14] (03PS2) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) [09:23:27] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.reboot-runner (exit_code=1) rolling reboot on A:gitlab-runner [09:23:49] (03PS3) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) [09:24:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli) [09:25:20] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet [09:25:37] RECOVERY - Check systemd state on es1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:00] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli) [09:26:08] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10hnowlan) [09:26:28] 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Volans) 05Open→03Resolved a:03Volans File removed `sudo rm mgmt-codfw/ssw1-a1-codfw.mgmt.codfw.wmn... [09:28:29] (03PS1) 10Filippo Giunchedi: perf: disable promql/series lint checks for navtiming [alerts] - 10https://gerrit.wikimedia.org/r/920211 (https://phabricator.wikimedia.org/T309182) [09:28:56] (03PS3) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) [09:30:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41198/console" [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:31:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet [09:32:34] (03CR) 10Klausman: [C: 03+1] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [09:33:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:36:16] (03PS2) 10Fabfur: admin: Add fabfur user - more readable [puppet] - 10https://gerrit.wikimedia.org/r/920209 [09:36:51] 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Volans) FYI the workflow is described at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Wo... [09:37:18] RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:38:16] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [09:40:30] (03PS1) 10Vgutierrez: cache::haproxy: Fix missing socket variable [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799) [09:41:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41199/console" [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:43:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Fix missing socket variable [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:43:37] (03PS3) 10Fabfur: admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 [09:44:27] (03CR) 10Filippo Giunchedi: [C: 03+2] dcops: temp disable promql/series pint check for InterfaceSpeedError [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:44:33] (03CR) 10Filippo Giunchedi: [C: 03+2] perf: disable promql/series lint checks for navtiming [alerts] - 10https://gerrit.wikimedia.org/r/920211 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:44:55] (03CR) 10Elukey: "Looks good to me (left a nit for the commit msg)! I don't see the revert risk namespace and configs in ml-staging-codfw, so you'll probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [09:44:59] 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jelto) [09:45:30] (03PS10) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [09:45:32] (03PS8) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [09:45:34] (03PS1) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213 [09:45:44] (03PS4) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) [09:45:55] (03CR) 10Elukey: [C: 03+1] helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [09:46:15] (03CR) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [09:46:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [09:46:57] (03CR) 10Vgutierrez: [C: 03+1] admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [09:49:13] !log btullis@deploy1002 Started deploy [airflow-dags/analytics_product@7642b62]: (no justification provided) [09:49:22] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_product@7642b62]: (no justification provided) (duration: 00m 09s) [09:49:27] (03CR) 10Elukey: [C: 03+1] "LGTM! I added Ben to the code review so Data Engineering can comment as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah) [09:50:31] (03CR) 10AikoChou: [C: 03+1] helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [09:51:06] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41200/console" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [09:51:14] (03CR) 10Elukey: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [09:51:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [09:52:32] (03CR) 10Elukey: [C: 03+1] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [09:52:40] (03CR) 10Elukey: [C: 03+2] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [09:53:22] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) Thanks for all the input, much appreciated! I'll revise the plan and update the task in the next days. [09:55:31] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [09:56:31] 10SRE, 10Observability-Metrics, 10Traffic, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) [09:58:09] (03PS1) 10Ladsgroup: Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) [09:58:36] PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:49] (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [09:59:04] (03CR) 10Muehlenhoff: [C: 03+2] Temporary drop krb1001 from KDC list used by clients [puppet] - 10https://gerrit.wikimedia.org/r/920204 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1000) [10:03:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:03:10] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:03:16] 10SRE, 10Observability-Metrics, 10serviceops, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) [10:04:18] (03CR) 10Volans: [C: 04-1] "There's a small issue with a condition" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [10:06:09] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [10:06:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:07:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:11:52] (03PS1) 10Elukey: conftool-data: add discovery config for the k8s-ingress-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/920215 [10:12:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks." [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:12:47] (03PS2) 10Elukey: conftool-data: add discovery config for the k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920215 [10:13:44] !log cleaning up echo notification table in all wikis (T318523) [10:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:48] T318523: Don't send article-linked notifications for bots - https://phabricator.wikimedia.org/T318523 [10:23:01] (03PS1) 10Effie Mouzeli: Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228 [10:23:25] (03PS1) 10Elukey: Add VIP records for the new k8s-ingress-ml-serve endpoint [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726) [10:26:14] (03CR) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:26:37] 10SRE, 10Domains: Mark Monitor administration panel - https://phabricator.wikimedia.org/T333827 (10Jacek_Broda_WMPL) a:05Jacek_Broda_WMPL→03None [10:27:14] (03PS1) 10Vgutierrez: hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799) [10:28:43] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [10:29:32] (03PS2) 10Vgutierrez: hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799) [10:29:46] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches upgrade - T335042 [10:29:50] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [10:30:05] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches... [10:30:17] (03CR) 10Vgutierrez: [C: 03+2] hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [10:32:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:32:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:32:51] (03CR) 10JMeybohm: [C: 03+2] Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:33:14] !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new VIP records for k8s-ingress-ml-serve - elukey@cumin1001" [10:33:45] !log testing HAProxy 2.7.8 in cp4052 and cp5032 (upload) - T317799 [10:33:48] ^^ cdanis [10:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:49] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [10:33:56] (03Merged) 10jenkins-bot: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:34:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new VIP records for k8s-ingress-ml-serve - elukey@cumin1001" [10:34:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:52] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc-wf[2001-2002].codfw.wmnet,mc-wf[1001-1002].eqiad.wmnet with reason: kernel upgrade [10:35:07] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on mc-wf[2001-2002].codfw.wmnet,mc-wf[1001-1002].eqiad.wmnet with reason: kernel upgrade [10:35:09] (03PS1) 10Elukey: service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726) [10:35:24] (03CR) 10Hnowlan: [C: 03+2] admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:36:38] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:36:41] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:37:55] (03Merged) 10jenkins-bot: admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:38:37] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [10:39:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:05] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [10:39:16] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:40:46] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:42:44] (03CR) 10Vgutierrez: trafficserver: allow partial traffic flow to mw on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto) [10:43:11] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab-runner1003.eqiad.wmnet [10:43:21] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: make mw-on-k8s use a config file [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto) [10:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:20] (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920219 (https://phabricator.wikimedia.org/T336726) [10:46:22] (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to production [puppet] - 10https://gerrit.wikimedia.org/r/920220 (https://phabricator.wikimedia.org/T336726) [10:46:43] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches... [10:48:13] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in codfw: codfw row D switches upgrade - T335042 [10:48:19] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [10:48:27] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None [10:48:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [10:48:44] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None [10:48:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [10:49:43] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:50:01] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:50:03] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet [10:51:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [10:51:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [10:51:38] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [10:52:04] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:52:53] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:54:33] (03PS1) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) [10:54:35] (03PS1) 10Elukey: Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) [10:55:22] (03CR) 10CI reject: [V: 04-1] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [10:55:27] (03CR) 10CI reject: [V: 04-1] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [10:56:03] (03PS2) 10Elukey: Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) [10:56:53] (03CR) 10CI reject: [V: 04-1] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [10:58:13] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet [10:58:36] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [10:59:09] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [11:00:03] !log updated bookworm image to RC3 T330495 [11:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:17] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [11:00:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [11:01:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bookworm [11:02:48] (03PS2) 10Slyngshede: signup:blocklist Expand blocklist feature [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 [11:03:33] (03PS1) 10Volans: dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696) [11:03:35] (03PS1) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 [11:03:37] (03CR) 10Slyngshede: signup:blocklist Expand blocklist feature (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [11:04:45] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [11:05:59] Does anyone mind if I use this empty window to deploy? [11:07:50] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:443]) https://wikitech.wikimedia.org/wiki/PyBal [11:08:14] <_joe_> uh [11:08:49] <_joe_> that's schema [11:08:59] (it would be zotero) [11:09:03] <_joe_> is someone doing something with lvs2009? [11:09:18] not me :) [11:09:34] <_joe_> mvolz: sorry I am looking at the alert right now [11:09:39] np [11:09:51] <_joe_> vgutierrez: topranks: ^^ [11:10:10] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:443]) https://wikitech.wikimedia.org/wiki/PyBal [11:10:59] * topranks here [11:11:15] * vgutierrez already checking [11:11:39] <_joe_> so it looks like the problem is every backend is depooled [11:12:12] again? :) [11:12:15] <_joe_> $ curl localhost:9090/pools/schema_443 [11:12:17] <_joe_> schema2004.codfw.wmnet: disabled/up/not pooled [11:12:19] <_joe_> schema2003.codfw.wmnet: disabled/up/not pooled [11:12:39] <_joe_> I guess someone has done something with those servers? [11:13:22] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:13:37] <_joe_> I repooled 2003 [11:13:43] nothing on SAL AFAIK [11:13:44] <_joe_> now we can check what caused it [11:13:55] <_joe_> yeah we need to go look at the etcd logs I guess [11:14:43] pybal noticed at 11:03:58 [11:15:17] May 16 11:03:58 lvs2010 pybal[1983329]: [schema_443] INFO: Merged disabled server schema2004.codfw.wmnet, weight 10 [11:15:17] May 16 11:03:58 lvs2010 pybal[1983329]: [schema_443] INFO: Merged disabled server schema2003.codfw.wmnet, weight 10 [11:15:20] yep [11:15:42] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:16:24] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [11:17:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2004.codfw.wmnet with OS bookworm [11:18:01] <_joe_> it was run from the servers [11:18:04] <_joe_> found with [11:18:07] and etcd as well [11:18:10] May 16 11:03:58 conf2005 etcdmirror-conftool-eqiad-wmnet[5393]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/eventschemas/eventschemas/schema2004.codfw.wmnet at index 1952400 [11:18:11] May 16 11:13:08 conf2005 etcdmirror-conftool-eqiad-wmnet[5393]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/eventschemas/eventschemas/schema2003.codfw.wmnet at index 1952401 [11:18:15] _joe_: oh [11:18:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bookworm [11:18:20] <_joe_> sudo cumin 'conf1*' 'fgrep schema2004 /var/log/nginx/etcd_access.log | grep -v GET' [11:18:29] so who ran that? [11:18:38] <_joe_> someone or something ran "depool" [11:18:42] <_joe_> on each server [11:18:53] <_joe_> vgutierrez: can you check the cumin logs? I'll check the servers [11:19:02] btullis: ^^ [11:19:12] btullis logged in at 11:03 [11:19:19] on schema2004 [11:19:31] probably prep for codfw row D maint ? [11:19:49] Yes, I depooled schema2004. Is there an issue? [11:20:07] <_joe_> btullis: schema2003 was also depooled [11:20:28] Oh, sorry. I hadn't seen that. [11:20:29] <_joe_> so I repooled it at 11:13 [11:20:43] Many thanks _joe_ [11:20:45] <_joe_> ok, mistery solved anyways, we were worried some cronjob caused this [11:20:53] !log reboot rdb2007 for kernel upgrades: possibly affected apps: netbox, changeprop, cpjobqueue, api-gateway, redisLockManager. Should be harmless however [11:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:59] yep good stuff [11:21:12] <_joe_> but schema was unavailable for 10 minutes in codfw [11:21:49] <_joe_> vgutierrez: topranks did you get paged? [11:21:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 13 hosts with reason: maintenance [11:21:53] nope [11:22:02] (03CR) 10Klausman: [C: 03+1] service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [11:22:04] <_joe_> if not, we might want to add a paging probe on lvs for that service [11:22:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: maintenance [11:22:21] Oh that's me. I forgot to rebpool scheman2003 after row C upgrade. https://phabricator.wikimedia.org/T334049#8819429 [11:22:41] (03CR) 10Klausman: [C: 03+1] Add VIP records for the new k8s-ingress-ml-serve endpoint (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [11:22:45] _joe_: didn’t get paged [11:23:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 11 hosts with reason: maintenance [11:23:09] <_joe_> btullis: I guess this is an actionable for you then [11:23:23] _joe_: Yes, I agree, that should page. I will add it. [11:23:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 11 hosts with reason: maintenance [11:23:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 14 hosts with reason: maintenance [11:23:39] <_joe_> also - depool_threshold is clearly too low for schema [11:23:46] <_joe_> pybal should protect against this [11:23:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 14 hosts with reason: maintenance [11:24:04] +1 this should probably go to to victorops [11:24:34] <_joe_> depool-threshold = .5 this is too low for a service with 2 servers [11:24:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [11:24:47] <_joe_> because it's computed before removing the server, not after [11:25:14] <_joe_> I guess this is a two-line patch to service::catalog [11:26:02] Yes. Would `depool_threshold: ".6"` be OK to avoid this happening? [11:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:49] (03PS1) 10Effie Mouzeli: Revert "php-multiversion-base: update readme" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920229 [11:30:04] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php-multiversion-base: update readme" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920229 (owner: 10Effie Mouzeli) [11:30:20] (03PS2) 10Effie Mouzeli: Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228 [11:30:31] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228 (owner: 10Effie Mouzeli) [11:30:42] (03PS1) 10Giuseppe Lavagetto: service::catalog: followup to schema incident [puppet] - 10https://gerrit.wikimedia.org/r/920248 [11:30:48] <_joe_> btullis: ^^ [11:30:58] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2002.codfw.wmnet with OS bookworm [11:32:20] _joe_: Great, many thanks. [11:32:39] (03CR) 10Btullis: [C: 03+1] "Many thanks for this change." [puppet] - 10https://gerrit.wikimedia.org/r/920248 (owner: 10Giuseppe Lavagetto) [11:34:35] (03PS1) 10KartikMistry: Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525) [11:36:31] <_joe_> vgutierrez: fancy a pybal restart cycle? [11:36:33] <_joe_> :P [11:37:28] (03PS1) 10KartikMistry: Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657) [11:38:24] <_joe_> mvolz: sorry, back to you - deploy whenever you want [11:38:35] ty! [11:38:39] <_joe_> sorry for the delay but we had an ongoing outage [11:38:41] np [11:38:47] it's not my window anyway :) [11:39:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: followup to schema incident [puppet] - 10https://gerrit.wikimedia.org/r/920248 (owner: 10Giuseppe Lavagetto) [11:39:16] (03CR) 10Mvolz: [C: 03+2] Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz) [11:40:11] (03Merged) 10jenkins-bot: Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz) [11:43:07] * kart_ updating MinT and cxserver [11:43:34] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657) (owner: 10KartikMistry) [11:43:50] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:44:34] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657) (owner: 10KartikMistry) [11:44:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup) [11:44:54] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:45:36] !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2]: Regular analytics weekly train [analytics/refinery@2a0b1f2] [11:46:01] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:46:20] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:47:22] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-codfw [11:47:40] <_joe_> jouncebot: now [11:47:40] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [11:47:47] (03CR) 10Hashar: [C: 03+1] "I'd love for Puppet to manage that list for us. modules/profile/manifests/ssh/client.pp as some magic Puppet DB query." [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [11:49:19] akosiaris: There are some unapplied changes in cxserver - is that safe to deploy? [11:49:29] Probably also on MinT. [11:49:36] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:49:40] <_joe_> kart_: what changes? if it's envoy-related, it's ok [11:50:04] <_joe_> let me check [11:50:08] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:50:15] _joe_: looks like that only. [11:50:28] but, can you please check? [11:50:33] (03CR) 10Jaime Nuche: doc: temporary config for docs publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [11:50:45] !log install 10.4.29 on db1151 T336462 [11:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:49] T336462: Compile and package MariaDB 10.4.29 - https://phabricator.wikimedia.org/T336462 [11:50:50] <_joe_> jayme: I think it's your changes in configuration to envoy [11:51:10] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-codfw [11:51:26] <_joe_> kart_: it should be ok to apply, go on [11:52:08] T300324. Yes. Thanks. [11:52:09] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [11:52:24] <_joe_> I'll restart the eqiad pybals after lunch [11:52:28] (03PS1) 10Effie Mouzeli: php-multiversion-base: add librsvg2-bin [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920257 (https://phabricator.wikimedia.org/T336025) [11:52:41] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:53:16] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:55:16] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:55:28] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:55:50] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:56:20] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:56:21] !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2]: Regular analytics weekly train [analytics/refinery@2a0b1f2] (duration: 10m 45s) [11:57:04] !log stage upgrade on asw-d-codfw - T335042 [11:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:08] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [11:58:16] (03PS1) 10Majavah: Add an option to disable NFS access [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) [11:58:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 (owner: 10Slyngshede) [11:59:09] !log Updated cxserver to 2023-05-16-061239-production (T336657) [11:59:12] (03CR) 10Elukey: Add VIP records for the new k8s-ingress-ml-serve endpoint (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey) [11:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:13] T336657: Enable MinT for Central Bikol in Content Translation - https://phabricator.wikimedia.org/T336657 [12:01:22] (03CR) 10KartikMistry: [C: 03+2] Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry) [12:02:02] (03Merged) 10jenkins-bot: Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry) [12:02:51] 10SRE-Access-Requests, 10Data-Engineering, 10Event-Platform Value Stream: Allow gmodena and tchin to merge changes to operation/deployment-charts repo - https://phabricator.wikimedia.org/T336755 (10Ottomata) [12:02:59] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:03:45] 10SRE-Access-Requests, 10Data-Engineering, 10Event-Platform Value Stream: Allow gmodena and tchin to merge changes to operation/deployment-charts repo - https://phabricator.wikimedia.org/T336755 (10Ottomata) [12:04:40] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:06:08] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [12:09:08] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [12:14:43] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove dns2002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/919847 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh) [12:15:01] (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:15:10] !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] [12:15:21] !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] (duration: 00m 10s) [12:15:22] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [12:17:43] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh) [12:18:15] (03PS5) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) [12:19:04] (03PS1) 10Ssingh: depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042) [12:19:41] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) > +1 extending the lifetime is just delaying the issue and increasing the possibility its forgotten or missed Yes and no. It depends on how much we can automate it with... [12:19:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:08] ^ expected [12:20:19] <_joe_> jouncebot: now [12:20:19] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [12:20:39] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [12:21:18] !Updated MinT to 2023-05-16-112045-production (T336525, T336649, T336483, T336349) [12:21:18] T336649: English not listed as target langauge in the UI - https://phabricator.wikimedia.org/T336649 [12:21:19] T336525: Review code mappings for MinT - https://phabricator.wikimedia.org/T336525 [12:21:19] T336349: Replace MinT dropdowns with ULS - https://phabricator.wikimedia.org/T336349 [12:21:19] T336483: Long sequence of a repeated word appears only when using MinT but not NLLB-200 directly - https://phabricator.wikimedia.org/T336483 [12:21:27] !log disable ping offload in codfw - T335042 [12:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:31] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [12:21:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:47] <_joe_> kart_: this is not a great time to deploy your code [12:21:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:18] (03CR) 10Ssingh: [C: 03+2] depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh) [12:22:24] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:32] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:22:34] PROBLEM - Bird Internet Routing Daemon on dns2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:22:37] ^ expected [12:22:42] (03CR) 10Ayounsi: [C: 03+1] depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh) [12:22:51] !log running authdns-update to disable codfw for switch upgrade: T335042 [12:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:31] !log [done] running authdns-update to disable codfw for switch upgrade: T335042 [12:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:24] !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] [12:27:29] !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] (duration: 00m 04s) [12:28:07] !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2a0b1f2] [12:29:37] !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2a0b1f2] (duration: 01m 30s) [12:31:54] btullis: I need you again :S We have not documented the solution to overcome the git issue we're having when deploying onto HDFS - can you tell me the trick again (I forgot :S) [12:35:00] !log start cadvisor 0.44 upgrade to buster hosts - T336740 [12:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:04] T336740: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 [12:36:12] woop wrong chan :S [12:36:53] (03PS1) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [12:37:33] _joe_: Did I miss anything? [12:39:04] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1009.eqiad.wmnet [12:39:20] !log reboot rdb1009 for kernel upgrades: possibly affected apps: netbox, changeprop, cpjobqueue, api-gateway, redisLockManager. Should be harmless however [12:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:56] <_joe_> kart_: sorry, I thought you were *about* to deploy your stuff and there is a maintenance going on in one of the datacenters [12:44:09] _joe_: I was done with it :) [12:44:37] <_joe_> kart_: yeah I realized in the meantime :) [12:44:53] <_joe_> I was trying to save you from a possible deployment failure [12:44:56] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:16] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1009.eqiad.wmnet [12:46:54] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 189 hosts with reason: codfw row D upgrade [12:47:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 236.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:47:10] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:48:35] kart_: oh, sorry. I only deployed staging before lunch - did not anticipate someone deploying cxserver the next hour [12:48:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 189 hosts with reason: codfw row D upgrade [12:49:11] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:00:00 on 189 host(... [12:50:06] !log disabling Puppet in codfw/esams/ulsfo for switch maintenance T335042 [12:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:11] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [12:51:27] !log depool thanos-fe2003 T335042 [12:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:13] !log depool ms-fe2012 T335042 [12:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:17] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [12:54:09] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [12:55:42] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:42] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1300) [13:00:04] mazevedo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1300) [13:00:17] hi! [13:00:56] fyi, we're going to start a 20/30min maintenance, please hold any deployment if possible [13:01:16] !log asw-d-codfw> request system reboot all-members - T335042 [13:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:21] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [13:02:47] I can deploy once XioNoX is done with the network maintenance [13:03:05] XioNoX: please add these to the deployment calendar the next time [13:03:30] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:03:30] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 462.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:04:33] ok [13:04:36] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:05:41] taavi: last time I looked I didn't understand how it worked, and didn't go far enough in the future to add them [13:06:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:10] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:22] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 135, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:30] (Emergency syslog message) firing: Alert for device asw-d-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:06:30] (virtual-chassis crash) firing: Alert for device asw-d-codfw.mgmt.codfw.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:06:34] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 2 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:06:44] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 25.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:07:08] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:08:16] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:08:16] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:08:21] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:08:33] (JobUnavailable) firing: (5) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:08:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:08:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:08:53] (03PS1) 10Daniel Kinzler: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) [13:09:14] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:23] (03PS1) 10Daniel Kinzler: Use MultiHttpClient instead of VirtualRESTService. [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920231 (https://phabricator.wikimedia.org/T335347) [13:11:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:30] (Emergency syslog message) resolved: Device asw-d-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:12:00] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia [13:12:00] i/CX [13:12:08] PROBLEM - configured eth on lvs2011 is CRITICAL: vlan2020 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:12:52] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [13:13:16] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:13:32] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:13:32] (JobUnavailable) firing: (6) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:13:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:13:54] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:14:16] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:14:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:14:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:14:38] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:15:08] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:16:03] (ProbeDown) resolved: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:17:36] (03PS2) 10Majavah: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo) [13:17:43] (03PS3) 10Majavah: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo) [13:18:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:21] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:32] (JobUnavailable) resolved: (6) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:20:10] XioNoX: can we go ahead with the deployment window or are things still recovering? [13:21:30] (virtual-chassis crash) resolved: Device asw-d-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:21:35] taavi: yep, everything is good now! [13:21:40] thanks! [13:21:48] thanks for waiting [13:21:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:21:54] (03PS1) 10Ssingh: Revert "hiera: temporarily remove dns2002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/920232 [13:22:09] mazevedo: deploying your patch now [13:22:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo) [13:22:23] awesome! let me know when to test [13:22:58] (03Merged) 10jenkins-bot: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo) [13:23:16] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:23:29] !log taavi@deploy1002 Started scap: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]] [13:23:34] T336508: Add MobileWikiAppiOSNavigationEvents to MEP - https://phabricator.wikimedia.org/T336508 [13:23:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:24:07] (03CR) 10AOkoth: [C: 03+1] vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris) [13:24:38] (03PS1) 10Ssingh: Revert "depool codfw for row D switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/920233 [13:24:51] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:25:23] !log enabled Puppet in codfw/esams/ulsfo for switch maintenance T335042 [13:25:24] RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:24] RECOVERY - Bird Internet Routing Daemon on dns2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:27] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [13:25:29] !log taavi@deploy1002 mazevedo and taavi: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:25:30] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:35] mazevedo: please test! [13:25:49] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [13:25:55] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) 05Open→03Resolved a:03ayounsi All stacks have been upgraded. Hopefully for the last time! [13:26:03] 10SRE, 10Observability-Metrics, 10serviceops, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed, we're running cadvisor `0.44.0+ds1-1~wmf1` on buster and bullseye [13:26:05] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [13:26:11] taavi it's working, thanks! [13:26:14] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:19] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica2006.wikimedia.org [13:26:28] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:46] ok, syncing [13:28:03] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:28:30] 10SRE, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10ayounsi) [13:28:48] (03CR) 10JMeybohm: [C: 03+2] Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:28:50] 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Done with all the sub-tasks upgrades. [13:29:08] (03PS6) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [13:29:39] (03CR) 10Ssingh: [C: 03+2] Revert "depool codfw for row D switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/920233 (owner: 10Ssingh) [13:30:22] !log running authdns-update to repool codfw [13:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] !log repool thanos-fe2003 T335042 [13:32:38] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]] (duration: 09m 08s) [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:40] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [13:32:44] T336508: Add MobileWikiAppiOSNavigationEvents to MEP - https://phabricator.wikimedia.org/T336508 [13:32:45] mazevedo: all done [13:32:49] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily remove dns2002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/920232 (owner: 10Ssingh) [13:33:11] !log mvernon@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfwm.wmnet,service=thanos-web [13:33:33] !log mvernon@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfw.wmnet,service=thanos-web [13:34:16] (03PS3) 10Ssingh: pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) [13:34:57] (03Merged) 10jenkins-bot: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:37:02] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:37:03] (03CR) 10D3r1ck01: "I thought Subbu already created a revert: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/919309?" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [13:37:48] (03PS2) 10JMeybohm: envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/916498 (https://phabricator.wikimedia.org/T303230) [13:38:18] (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [13:38:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41201/console" [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:39:07] (03CR) 10DCausse: search: Add alert based on age of titlesuggest indices (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [13:39:25] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=eventschemas,dc=codfw,name=schema2004.eqiad.wmnet [13:39:45] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=eventschemas,dc=codfw,name=schema2004.codfw.wmnet [13:42:40] RECOVERY - configured eth on lvs2011 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:44:48] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:45:29] <_joe_> vgutierrez: going to restart pybals in eqiad, FYI [13:45:36] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-eqiad [13:46:04] (03PS1) 10Volans: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 [13:46:05] !log repool ms-fe2012 T335042 [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:10] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [13:46:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye [13:46:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with... [13:46:42] (03CR) 10Ayounsi: [C: 03+1] users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans) [13:47:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:48:40] (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [13:48:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:49:32] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-eqiad [13:51:20] (03PS1) 10Ssingh: Revert "Revert "dns2005: add Puppet role and DNS/NTP configs"" [puppet] - 10https://gerrit.wikimedia.org/r/920234 [13:52:55] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "dns2005: add Puppet role and DNS/NTP configs"" [puppet] - 10https://gerrit.wikimedia.org/r/920234 (owner: 10Ssingh) [13:53:39] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: add librsvg2-bin [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920257 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli) [13:53:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [13:53:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2005.wikimedia.org with OS bullseye [13:54:21] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upgrade done - T335042 [13:54:24] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [13:54:36] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upg... [13:55:17] (03PS5) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) [13:57:03] (03PS1) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) [13:57:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [13:57:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [13:58:35] (03CR) 10Btullis: "The latest errors are mentioned here: https://phabricator.wikimedia.org/T334493#8855435" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [13:58:52] (03PS2) 10Volans: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 [13:58:54] (03PS1) 10Volans: login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 [13:58:56] (03PS1) 10AikoChou: changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) [13:59:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [13:59:32] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641) [13:59:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris) [13:59:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris) [13:59:42] (03CR) 10Ayounsi: [C: 03+1] login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans) [14:00:44] (03CR) 10Volans: [C: 03+2] login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans) [14:00:49] (03CR) 10Volans: [C: 03+2] users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans) [14:00:53] (03CR) 10Hashar: [C: 03+1] gerrit: remove gerrit1001 as a source host for migrations [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [14:01:19] (03Merged) 10jenkins-bot: login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans) [14:01:22] (03Merged) 10jenkins-bot: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans) [14:02:20] (03PS2) 10David Caro: Revert "Revert "toolforge_cli: add api gateway url and builds endpoint"" [puppet] - 10https://gerrit.wikimedia.org/r/918544 [14:02:22] (03CR) 10Btullis: "PCC doesn't like it :-(" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:02:35] (03PS3) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918544 [14:05:40] (03CR) 10Muehlenhoff: [C: 03+1] Add the refinery-cache directory to the git safe list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:05:44] (03PS2) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) [14:05:48] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:06] (03CR) 10CI reject: [V: 04-1] Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:06:20] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [14:06:51] (03CR) 10Daniel Kinzler: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (031 comment) [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [14:07:28] (03PS1) 10Effie Mouzeli: minor fix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920290 [14:08:32] (03PS3) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) [14:08:55] (03CR) 10D3r1ck01: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (031 comment) [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler) [14:10:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [14:10:25] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upg... [14:10:45] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in codfw: codfw row D switches upgrade done - T335042 [14:10:49] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [14:11:13] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] minor fix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920290 (owner: 10Effie Mouzeli) [14:11:19] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:11:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:43] (03CR) 10Andrew Bogott: "This needs further clarification as we can no longer distinguish between the VM range and the new range that will include cloudcontrols. " [puppet] - 10https://gerrit.wikimedia.org/r/919292 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:11:59] (03PS1) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) [14:14:20] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:15:09] (03PS2) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) [14:16:44] (03PS1) 10Ssingh: config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) [14:17:40] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) [14:18:26] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) (duration: 00m 45s) [14:18:57] (03CR) 10SBassett: [V: 03+1] "Seems right to me, though I'm, at best, a puppet novice." [puppet] - 10https://gerrit.wikimedia.org/r/920194 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [14:19:54] (03CR) 10Btullis: [C: 03+2] "Merging based on previous vote." [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:19:59] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 51.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:20:37] (03CR) 10Ayounsi: [C: 03+1] config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh) [14:20:50] (03PS3) 10Bking: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:21:24] (03CR) 10Ayounsi: [C: 03+2] config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh) [14:22:01] (03Merged) 10jenkins-bot: config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh) [14:24:45] jouncebot: now [14:24:45] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [14:24:49] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10herron) [14:25:00] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:25:24] 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10sbassett) 05Open→03Declined I think the current incarnation of the #security-team would... [14:25:45] (03Merged) 10jenkins-bot: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:26:38] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went very well. Thanks everybody! That was the last one! [14:26:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2005.wikimedia.org with OS bullseye [14:26:49] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:26:50] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:26:51] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:26:56] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:27:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**PASS**)... [14:27:13] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:27:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:29:23] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Albertoleoncio) [14:30:11] (03PS15) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [14:30:18] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [14:30:26] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [14:31:29] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:31:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:32:07] (03CR) 10Herron: [C: 03+2] role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [14:32:24] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:32:28] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:32:33] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:33:41] (03PS1) 10JMeybohm: users: Update my SSH key to a ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/920295 (https://phabricator.wikimedia.org/T336769) [14:35:18] (03PS1) 10Guergana Tzatchkova: Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920296 (https://phabricator.wikimedia.org/T336760) [14:36:17] (03CR) 10Giuseppe Lavagetto: trafficserver: allow partial traffic flow to mw on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto) [14:36:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: make mw-on-k8s use a config file [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto) [14:36:50] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:37:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: allow partial traffic flow to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto) [14:37:23] (03PS2) 10Giuseppe Lavagetto: trafficserver: allow partial traffic flow to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) [14:38:44] <_joe_> sigh come on jenkinsss [14:39:04] (03PS2) 10Ssingh: sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688) [14:39:58] (03PS1) 10Herron: arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) [14:42:03] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [14:42:10] (03PS1) 10Herron: arclamp: switch redis server to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) [14:42:27] (03CR) 10Elukey: [C: 03+2] conftool-data: add discovery config for the k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920215 (owner: 10Elukey) [14:42:36] !log "cr*-codfw*" commit "Gerrit: 919876 add new DNS host dns2005": T326688 [14:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:41] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [14:42:46] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:43:20] !log Restarting CI Jenkins [14:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:28] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:47:35] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:48:44] !log [done] "cr*-codfw*" commit "Gerrit: 919876 add new DNS host dns2005": T326688 [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:48] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [14:49:08] !log installing libxml2 security updates on buster [14:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 574.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:51:49] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10Jhancock.wm) power cord in PSU1 was replaced and secured. alert has cleared [14:51:51] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10RhinosF1) @Samwilson: assuming this is following https://wikitech.wikimedia.org/wiki/Volunteer_NDA, please get your manager to comment an... [14:51:54] (03PS1) 10Giuseppe Lavagetto: trafficserver: actually carry over the config file [puppet] - 10https://gerrit.wikimedia.org/r/920300 [14:52:47] (03PS1) 10Jgreen: users: change my own SSH key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920301 (https://phabricator.wikimedia.org/T336769) [14:53:29] (03PS1) 10Ssingh: hiera: add new DNS host dns2005 [puppet] - 10https://gerrit.wikimedia.org/r/920302 (https://phabricator.wikimedia.org/T326688) [14:55:02] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:55:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: actually carry over the config file [puppet] - 10https://gerrit.wikimedia.org/r/920300 (owner: 10Giuseppe Lavagetto) [14:56:20] (03CR) 10Ssingh: [C: 03+2] hiera: add new DNS host dns2005 [puppet] - 10https://gerrit.wikimedia.org/r/920302 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [14:57:02] (03PS1) 10Jgreen: Change my own SSH key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/920303 [14:58:16] (03PS2) 10Jgreen: Change my own SSH key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/920303 [14:58:32] (03CR) 10Pmiazga: [C: 03+1] rest-gateway: don't append when setting headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [14:58:38] (03PS1) 10Bking: rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244) [14:59:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye [14:59:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [15:00:10] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [15:01:25] (03PS1) 10Guergana Tzatchkova: Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920306 (https://phabricator.wikimedia.org/T335099) [15:03:52] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:04:43] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:07:19] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:16:54] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:17:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:43] !log CI Jenkins jobs are stall following the plugins upgrade :/ [15:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:33] !log rebalance codfw swift rings T335280 [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:37] T335280: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 [15:27:52] !log Restarting CI Jenkins [15:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:08] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:32:16] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:33:18] !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.10 208.80.153.48 208.80.153.74 ] [15:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:32] !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.10 208.80.153.48 208.80.153.74 ]: T326688 [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:35] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [15:36:00] !log Some CI jobs started failing after an upgrade of some Jenkins plugins. I have upgraded a couple more and it seems to work now T336775 [15:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:05] T336775: Jenkins CI job castor-save-workspace-cache stall breaking the whole CI - https://phabricator.wikimedia.org/T336775 [15:41:44] !log joal@deploy1002 Started deploy [airflow-dags/analytics@7fa2dcd]: Regular analytics weekly train [airflow-dags@7fa2dcd] [15:41:54] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@7fa2dcd]: Regular analytics weekly train [airflow-dags@7fa2dcd] (duration: 00m 10s) [15:49:53] !log run authdns-update for CR 920314 [15:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:06:43] !log gitlab-runner2003 - installed rsync client for debugging an issue with rsync from inside containers, comparing to from outside container [16:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:38] is it me or is wikibugs not active [16:14:26] I feel like it's been flaky the last few days [16:15:00] I have a mental habit of using it to keep track of the order of changes [16:15:01] (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:24:48] would someone with the right permissions please restart wikibugs? thank you :) [16:30:01] (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:30:01] !log restarting wikibugs ( https://www.mediawiki.org/wiki/Wikibugs#Help ) [16:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:30:18] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: be more specific about password ACL [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723) [16:30:23] sukhe: ^^ [16:30:55] thank you, I couldn't login to toolforge for some reason [16:31:03] I did know about the link [16:31:33] couple of old pods are taking long time to be terminated [16:31:39] keeping an eye on them for now [16:32:14] (03CR) 10CI reject: [V: 04-1] openstack: keystone: be more specific about password ACL [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez) [16:32:50] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "not interested in merging this, it was just a PoC" [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez) [16:33:57] (03CR) 10Pmiazga: [C: 03+1] rest-gateway: don't append when setting headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [16:36:51] (03PS4) 10EoghanGaffney: Change doc hosts to use rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/920310 (https://phabricator.wikimedia.org/T333945) [16:37:25] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: don't append when setting headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [16:40:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:40] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:53] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41204/console" [puppet] - 10https://gerrit.wikimedia.org/r/920310 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [16:43:57] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [16:44:07] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Thanks @hnowlan took me a bit to find this, but I did and we adde... [16:44:37] (03PS3) 10Arturo Borrero Gonzalez: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [16:50:33] (03PS1) 10Arturo Borrero Gonzalez: keystone: service: allow cloud-private supernet [puppet] - 10https://gerrit.wikimedia.org/r/920348 (https://phabricator.wikimedia.org/T336723) [16:53:59] (03PS1) 10Volans: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) [16:55:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:56:07] (03PS1) 10Ssingh: hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777) [16:57:31] (03PS1) 10Dwisehaupt: config/common.yaml: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769) [16:58:08] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns2002 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920320 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [16:59:49] !log installing 5.10.179 kernels on Bullseye hosts [16:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:02] (03PS2) 10Dwisehaupt: users: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1700) [17:00:12] !log homer "cr*-codfw*" commit "Gerrit: 920320 remove to-be decommissioned host dns2002" T335777 [17:00:14] (03CR) 10Cathal Mooney: "lgtm, just one small typo I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [17:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:19] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [17:01:08] (03PS2) 10Volans: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) [17:01:17] (03CR) 10Volans: "good catch, thx, addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [17:02:44] (03PS2) 10Ssingh: hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777) [17:03:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [17:03:56] (03PS2) 10AikoChou: changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) [17:04:13] (03CR) 10Volans: [C: 03+2] sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [17:05:20] (03CR) 10AikoChou: changeprop: add liftwing outlink topic stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [17:05:26] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [17:06:44] (03Merged) 10jenkins-bot: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [17:09:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2002.wikimedia.org [17:10:52] Hey jayme I was about to start a deployment for mobileapps service but I see you're working on mesh upgrade at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/919849, I see it's not deployed yet. Should I hold the image bumping for mobileapps? [17:12:13] mbsantos: thanks for reaching out! Please feel free to deploy the change with your image bump, there is no change in behaviour expected (and non seen as of now ;)) [17:12:46] thanks! [17:14:55] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:16:57] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:17:11] !log volans@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:17:12] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:17:13] 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10Traffic, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10FJoseph-WMF) I've scheduled a meeting this week for followup [17:18:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:18:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns2002.wikimedia.org [17:18:15] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2002.wikimedia.org` - dns2002.wikimedia.org (**WARN**) - Downtime... [17:19:06] (03PS1) 10BCornwall: pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) [17:19:07] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin1001" [17:19:10] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [17:19:36] (03PS1) 10MSantos: mobileapps: bump to 023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 [17:19:44] (03CR) 10CI reject: [V: 04-1] mobileapps: bump to 023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos) [17:20:07] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin1001" [17:20:08] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:56] (03PS1) 10Ssingh: hiera: remove obsolete dns2001.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/920355 (https://phabricator.wikimedia.org/T335777) [17:21:24] (03CR) 10Ssingh: [C: 03+2] hiera: remove obsolete dns2001.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/920355 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [17:21:34] (03PS2) 10MSantos: mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 [17:21:57] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41205/console" [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:22:58] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos) [17:23:53] (03Merged) 10jenkins-bot: mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos) [17:24:08] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41206/console" [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:24:14] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:24:58] (03CR) 10Ssingh: [C: 03+1] pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:26:20] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin1001" [17:27:18] !log Rolling out maglev LVS scheduler in drmrs - T263797 [17:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:21] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [17:27:23] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin1001" [17:27:23] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:23] !log volans@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:29:13] (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:34:05] !log joal@deploy1002 Started deploy [airflow-dags/analytics@7816937]: Regular analytics weekly train - Hotfix [airflow-dags@7816937] [17:34:16] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@7816937]: Regular analytics weekly train - Hotfix [airflow-dags@7816937] (duration: 00m 10s) [17:37:15] !log volans@cumin1001 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet [17:37:16] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:39:14] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin1001" [17:40:14] !log installing avahi security updates on buster [17:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:18] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin1001" [17:40:19] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:40:47] (03PS1) 10Bartosz Dziewoński: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) [17:40:54] (03PS1) 10Ssingh: dns2006: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/920357 (https://phabricator.wikimedia.org/T326688) [17:41:03] (03PS1) 10Bartosz Dziewoński: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920238 (https://phabricator.wikimedia.org/T317375) [17:41:41] (03PS1) 10Ssingh: sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688) [17:41:42] 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) a:05ssingh→03None [17:43:14] (03CR) 10Ssingh: [C: 03+2] dns2006: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/920357 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [17:44:24] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:45:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bullseye [17:46:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2006.wikimedia.org with OS bullseye [17:46:11] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin1001" [17:47:14] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin1001" [17:47:14] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:14] !log volans@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a8-codfw.mgmt.codfw.wmnet [17:52:38] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:52:54] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:53:21] 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10sbassett) >>! In T40860#8855780, @Dzahn wrote: > Ok, well, do you want to do anything about... [17:54:39] 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Dzahn) Alright, thanks for the details! I just meant besides GPG now, just if we should stop... [17:55:19] (03Abandoned) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [17:55:43] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:56:35] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:57:48] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:58:47] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:59:05] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Dzahn) adding @scherukuwada since they just became owner of wikisource.org (all subdomains under it) in search console last week in T336500 [18:00:04] dancy and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1800). [18:01:39] !log enable puppet on A:cp-text [18:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:47] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) Hi @SCherukuwada , there is a volunteer asking for access at T336255. Wondering if you have thoughts on that? Also at... [18:02:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [18:04:56] I am here to press the buttons [18:05:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [18:05:52] (03PS1) 10Ssingh: hiera: add new DNS host dns2006 [puppet] - 10https://gerrit.wikimedia.org/r/920359 (https://phabricator.wikimedia.org/T326688) [18:06:46] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Dzahn) Every subdomain is a separate site. Is this request really for ALL of wikisource or for a few languages? That changes the nature o... [18:10:05] Train is blocked due to ongoing toolforge/WMCS issues: Can't reach https://train-blockers.toolforge.org/api.php right now. [18:10:44] PROBLEM - Recursive DNS on 208.80.153.107 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:10:47] dancy: have you raised [18:10:57] It might just need webservice restart [18:11:01] ^ DNS message above expected [18:13:27] and we're back! [18:13:35] deployment blocked because we cant read the page with the blockers? heh [18:13:40] ok! [18:14:00] RECOVERY - Recursive DNS on 208.80.153.107 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:14:02] sukhe: pheew. thanks for mentioning that [18:14:05] mutante: now fixed, think it's used to set which train now as well [18:14:25] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215) [18:14:27] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:15:20] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:15:30] RhinosF1: I think everything that is actually used for prod and deployment should be in production.. but yea [18:16:45] mutante: pretty sure train-blockers is a quick hack from taavi [18:16:55] It probably could be formalised more [18:20:05] Hmm.. docker-registry.wikimedia.org/php7.4-fpm-multiversion-base changed since train presync happened, so this will be a long deployment (~40 minutes) due to full image rebuild. [18:20:49] less than 40, actually.. but longer than 7. [18:21:39] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Jhancock.wm Thank you for your help, this is good to go. [18:22:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2006.wikimedia.org with OS bullseye [18:22:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2006.wikimedia.org with OS bullseye completed: - dns2006 (**PASS**)... [18:24:20] (03CR) 10Gehel: "A few minor style comments, mostly just to prove that I did read the code. I don't know enough about Ceph to have an opinion on how this c" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [18:25:22] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [18:25:37] (03CR) 10Ssingh: [V: 03+2 C: 03+2] sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [18:28:37] !log homer "cr*-codfw*" commit "Gerrit: 920358 add new DNS host dns2006": T326688 [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:41] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [18:31:19] (03CR) 10Ssingh: [C: 03+2] hiera: add new DNS host dns2006 [puppet] - 10https://gerrit.wikimedia.org/r/920359 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh) [18:34:18] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.9 refs T330215 [18:34:23] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:34:39] (03CR) 10Andrew Bogott: [C: 03+2] "pcc results: https://puppet-compiler.wmflabs.org/output/920348/41207/" [puppet] - 10https://gerrit.wikimedia.org/r/920348 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez) [18:36:18] !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.48 208.80.153.74 208.80.153.107 ]: T326688 [18:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:24] T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 [18:39:06] (03PS1) 10Ssingh: sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777) [18:41:03] !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet [18:41:05] !log volans@cumin2002 START - Cookbook sre.dns.netbox [18:42:21] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Soda) >>! In T336255#8856659, @Dzahn wrote: > Every subdomain is a separate site. Is this request really for ALL of wikisource or for a f... [18:42:56] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002" [18:42:57] (03PS1) 10Ssingh: hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777) [18:43:56] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002" [18:43:56] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:57] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [18:44:30] (03PS2) 10Ssingh: hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777) [18:44:32] (03Merged) 10jenkins-bot: sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [18:46:30] !log volans@cumin2002 START - Cookbook sre.dns.netbox [18:46:42] !log homer "cr*-codfw*" commit "Gerrit: 920363 remove to-be decommissioned host dns2003": T335777 [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:46] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [18:46:55] !log [WDQS] Pooled `wdqs2006` (not sure why was depooled) [18:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:46] !log [WDQS] Pooled `wdqs2012` [18:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:38] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [18:49:06] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin2002" [18:50:01] (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:50:04] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin2002" [18:50:04] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:04] !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a8-codfw.mgmt.codfw.wmnet [18:50:06] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [18:50:14] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2022.* [18:50:41] ryankemper: see https://phabricator.wikimedia.org/T335042 for why [18:50:42] (03PS1) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) [18:50:59] Likely [18:51:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [18:51:41] RhinosF1: ha, doh :) I was checking the current pybal states before re-pooling stuff for that maintenance and forgot that those were the hosts for that [18:51:44] * ryankemper was the one that depooled them xD [18:51:51] (03CR) 10Herron: [C: 03+2] mwlog: rotate api.log hourly [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [18:51:55] ryankemper: heh [18:52:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2003.wikimedia.org [18:53:58] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [18:54:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [18:57:00] (03PS1) 10Jdrewniak: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920239 (https://phabricator.wikimedia.org/T336640) [18:57:05] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:59:37] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:00:30] (03PS1) 10Jdrewniak: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640) [19:00:47] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10SCherukuwada) Please assign this to me once C-level approval and NDA have been taken care of. [19:00:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:00:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns2003.wikimedia.org [19:00:57] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2003.wikimedia.org` - dns2003.wikimedia.org (**WARN**) - Downtime... [19:01:30] (03PS1) 10Jdrewniak: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920241 (https://phabricator.wikimedia.org/T336640) [19:01:40] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [19:02:48] (03PS1) 10Jdrewniak: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) [19:03:19] (03PS1) 10Volans: sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) [19:03:59] (03PS2) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) [19:04:19] !log dummry run of authdns-update to confirm new hosts [19:04:19] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [19:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:27] (03PS3) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) [19:06:26] (03CR) 10Volans: [C: 03+2] sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [19:06:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:06:46] (03CR) 10Ladsgroup: [C: 03+2] Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup) [19:07:01] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41208/console" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:08:33] (03Merged) 10jenkins-bot: sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [19:08:35] (03Merged) 10jenkins-bot: Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup) [19:08:38] (03PS1) 10Cathal Mooney: Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769) [19:08:56] (03PS1) 10Herron: logrotate: update description in override [puppet] - 10https://gerrit.wikimedia.org/r/920368 [19:10:29] !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet [19:10:31] !log volans@cumin2002 START - Cookbook sre.dns.netbox [19:12:25] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002" [19:13:30] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002" [19:13:30] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:47] (03CR) 10Bking: "@jbond Wanted to solicit your advice on this one. In the original patch set, we attempted to use hieradata/common/profile/query_service.ya" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:23:12] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:14] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:29] (03CR) 10Cathal Mooney: [C: 03+2] Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769) (owner: 10Cathal Mooney) [19:43:03] (03Merged) 10jenkins-bot: Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769) (owner: 10Cathal Mooney) [19:55:00] (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit1003 SSH host key known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:56:18] (03CR) 10Dzahn: [C: 03+2] "yep, agreed it would be nice if this was automatic but also not right now" [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:57:10] (03CR) 10Dzahn: [C: 03+2] "key was added on 1003 and 2002 - though this will only matter if we start replicating TO this machine - if we do that in the future" [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T2000). [20:00:05] MatmaRex and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] hi [20:00:32] o/ [20:00:49] (03CR) 10Jameel Kaisar: "Note: For Reference Only, Not to be Merged" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [20:01:12] feel free to start with jan's stuff, looks more urgent [20:03:15] Also I can self deploy [20:03:49] looks like no one else is doing it, so… ;) [20:03:59] (03CR) 10Dzahn: [C: 03+2] "here is the part that matters, nothing is changed on prod host: https://puppet-compiler.wmflabs.org/output/919359/41172/gerrit1003.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:04:24] i'd appreciate if you could sync and run my maintenance script afterwards. it's a dry run, just testing that it works in production before the real deployment [20:04:55] MatmaRex: yeah no problem [20:05:32] Got a deployer? [20:05:44] (yes, seems so) [20:06:08] lemme deploy a change to stop gerrit service.. on the old host :) [20:07:46] mutante: Ok, let me know when I can proceed with the backport window [20:08:41] jan_drewniak: thank you, a minute.. on it [20:08:48] confirmed noop on gerrit2002.. now gerrit1003 [20:10:07] no problems on prod server [20:10:13] re-enabling puppet on old server [20:11:01] have to make sure it doesn't start gerrit service then all is done [20:12:06] confirmed: Loaded: masked (Reason: Unit gerrit.service is masked.) [20:12:17] it is now masked which is what this change was supposed to do [20:12:26] means it cant be started by accident and replicate or anything. [20:12:28] I am done [20:12:34] jan_drewniak: go ahead please. thank you for patience [20:12:49] No problem, that was quick! [20:13:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [20:16:35] (03CR) 10Dzahn: [C: 03+2] "confirmed this all works as intended. on gerrit1001 the gerrit service is now masked and on gerrit1003 and gerrit2002 there was no change " [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:16:41] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10taavi) [20:16:56] (03CR) 10Dzahn: [C: 03+2] "@hashar masked, not just stopped, as you asked for:)" [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:17:50] (03CR) 10Dzahn: [C: 03+2] gerrit: remove gerrit1001 as a source host for migrations [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:23:42] (03CR) 10Dzahn: [C: 03+2] "what this did:" [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [20:24:23] (03CR) 10Dzahn: [C: 03+2] gerrit: disable monitoring for gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:24:52] (03CR) 10Jdrewniak: [C: 03+2] Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [20:28:43] (03Merged) 10jenkins-bot: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [20:28:44] MatmaRex: while we're waiting for those to merge, I'm looking at your patch but I don't actually know how to deploy that... (like, where should that script be run?) [20:29:09] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]] [20:29:14] T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640 [20:29:15] T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641 [20:29:43] (03CR) 10Dzahn: [C: 03+2] "I could see on alert1001 how icinga checks were removed from config but I still see in Icinga web UI.. is it on 2001? running puppet there" [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:30:12] jan_drewniak: hm, i'm not sure either but it's documented somewhere, let me see if i can find it [20:30:15] !log Rolling out maglev LVS scheduler in drmrs (for real this time) - T263797 [20:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [20:30:41] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:31:15] jan_drewniak: there's a deploy commands tool [20:31:50] https://wikitech.wikimedia.org/wiki/Maintenance_server [20:31:50] jan_drewniak: https://deploy-commands.toolforge.org/bacc [20:32:03] jouncebot: now [20:32:03] For the next 0 hour(s) and 27 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T2000) [20:32:37] (03CR) 10Andrew Bogott: [C: 03+2] remove wmcs-backup-instances script, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/919896 (owner: 10Andrew Bogott) [20:33:04] i don't think the deploy commands are relevant for running a maintenance script, just for other deployments [20:33:51] (03CR) 10Dzahn: [C: 03+2] "this just removed a few of them, like HTTPS on gerrit1001, but gerrit1001 still has the base checks that are not specific to service and a" [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:33:52] MatmaRex: the script is just ran on mwmaint ye [20:35:14] jan_drewniak: summarizing from that page – i think i want you to ssh into mwmaint1002, then run `mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` [20:35:36] i'm not sure if that's exactly the right command for run.php stuff, but we can try and see, nothing terrible will happen if it fails [20:36:03] MatmaRex: ok thanks, I was just reading that :) [20:36:18] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:36:53] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]] (duration: 07m 44s) [20:37:00] T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640 [20:37:01] T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641 [20:37:08] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [20:37:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [20:39:24] (03Merged) 10jenkins-bot: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak) [20:39:46] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]] [20:41:25] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:41:47] (03PS1) 10Volans: install_server: fix ztp-juniper script [puppet] - 10https://gerrit.wikimedia.org/r/920374 (https://phabricator.wikimedia.org/T336485) [20:45:27] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336814 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:45:31] 10SRE, 10ops-eqiad: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336814 (10ops-monitoring-bot) [20:46:46] (03CR) 10Jdrewniak: [C: 03+2] Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:47:44] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:49:06] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]] (duration: 09m 19s) [20:49:12] T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640 [20:49:13] T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641 [20:49:41] !log volans@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a8-codfw.mgmt.codfw.wmnet [20:49:50] MatmaRex: ok I'm deploying yours to 1.8 first, then I'll run the script, then I'll do 1.9, does that sound good? [20:50:33] yeah, sounds correct [20:50:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:51:12] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:51:50] (03Merged) 10jenkins-bot: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński) [20:52:19] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]] [20:52:23] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [20:53:27] MatmaRex: assuming there's nothing to check on mwdebug? [20:53:41] nope [20:53:49] !log jdrewniak@deploy1002 jdrewniak and matmarex: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:59:37] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]] (duration: 07m 18s) [20:59:42] T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375 [21:00:50] alright this is what I'm gonna run `php maintenance/run.php MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` [21:02:09] jan_drewniak: not with mwscript? [21:02:40] yeah the above just failed, lol, I'll do with mwscript [21:02:51] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336720 (10wiki_willy) a:03Jhancock.wm [21:03:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10wiki_willy) a:03Jhancock.wm [21:04:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10wiki_willy) a:03RobH [21:04:58] 10SRE, 10ops-eqsin, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus5001 - https://phabricator.wikimedia.org/T335587 (10wiki_willy) a:03RobH [21:05:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus5001 - https://phabricator.wikimedia.org/T335587 (10wiki_willy) @RobH - this might be something we could add to the recycle pickup [21:06:01] 10ops-drmrs, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus6001 - https://phabricator.wikimedia.org/T335588 (10wiki_willy) a:03RobH [21:06:22] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:06:30] RECOVERY - pybal on lvs6002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:06:57] MatmaRex: running the script with mwscript doesn't work, gives me this error `It does not set $maintClass and does not return a class name.` [21:07:09] so I think I have to run it with run.php [21:07:09] (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4006 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [21:07:10] RECOVERY - PyBal backends health check on lvs6002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:07:39] hmm [21:07:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Result of the testing with Cathal. I first want to thank @cmooney for all the help with JunOS-magics, that was pre... [21:08:11] jan_drewniak: what's your exact command? mwscript should already use run.php internally [21:08:30] got p.aged. acked. [21:09:17] mutante: fyi deployment going on too [21:09:30] jan_drewniak: you don't need run.php if using mwscript [21:09:40] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [21:10:12] it's not the LVS thing [21:10:14] ok so where should I run it from? `mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` not found [21:10:34] the extension root? [21:11:05] jan_drewniak: try the full path to the script [21:11:14] I believe both that and class is supported [21:11:23] jan_drewniak: anywhere [21:11:30] i don't think it matters what directory you're in [21:11:43] mutante: are you happy with a confused mediawiki deployment going on? [21:12:28] RhinosF1: no reason to believe its' related to deployment, but I cant pay attention to deployment [21:12:41] Good [21:12:41] jan_drewniak: anyway, if we can't get it to work, i can try again tomorrow. it's not urgent and we're past time [21:13:10] MatmaRex: might be worth switching to the php file format if you're not sure on class but [21:13:26] I'm running it with the full path ` mwscript /srv/mediawiki/php-1\41\0-wmf\8/maintenance/MediaWiki\Extension\DiscussionTools\Maintenance\NewTopicOptOutActiveUsers.php --wiki=fiwiki --dry-run` but still "not found" [21:13:49] That wouldn't be right anyway... [21:13:51] jan_drewniak: why \ instead of . [21:13:55] `mwscript extensions/DiscussionTools/maintenance/NewTopicOptOutActiveUsers.php --wiki fiwiki --dry-run` ? [21:13:58] i don't think that would work. mwscript should figure out the path itself [21:13:59] ^ [21:14:06] But yes what TheresNoTime said [21:15:10] https://www.irccloud.com/pastebin/0ARobnYF/ [21:15:31] MatmaRex: is that an issue with the script? [21:16:11] i don't know, that's weird [21:16:18] That's an issue with the script yes [21:16:20] i am sure that the script *can* be executed using MaintenanceRunner [21:16:24] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:33] because i've been running it with run.php locally [21:16:59] well, actually [21:17:13] the problem is that it's trying to execute it from the file name [21:17:27] it should be executed using the weird namespace path with dots [21:17:34] MatmaRex: is this script only run manually? Does it fix anything urgent or if we are unsure, could the running be halted until people confident are around? [21:17:38] which is supposed to be the new hotness in executing maintenance scripts [21:17:45] MatmaRex: run.php should support both types [21:17:54] And that didn't work either for jan_drewniak [21:18:06] RhinosF1: i have already said that we can drop it. but it looks like folks want to figure it out [21:18:18] Reedy, TheresNoTime: ^ [21:18:24] i don't know what commands jan used, although i'd be curious to see [21:18:43] i think the correct command is: mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run [21:19:03] anyway. i can try again tomorrow if you want to close the window. i'm completely fine with taht [21:19:18] I'm not important enough to make that call [21:19:31] hence i'm asking jan_drewniak [21:19:31] https://www.irccloud.com/pastebin/CGImOadJ/NewTopicOptOutActiveUsers%20test [21:19:32] But unsure people randomly guessing commands doesn't feel safe [21:19:57] jan_drewniak: what was the full error when using the format with .'s [21:19:58] MatmaRex: yeah, I tried both [21:20:04] The same? [21:20:46] ok, that's interesting. this part: "Script '/srv/mediawiki/php-1.41.0-wmf.8/maintenance/MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers' not found (tried path '/srv/mediawiki/php-1.41.0-wmf.8/maintenance/MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers.php' and class '/srv/mediawiki/php-1\41\0-wmf\8/maintenance/MediaWiki\Extension\DiscussionTools\Maintenance\NewTopicOptOutActiveUsers') [21:21:06] i don't know where this error comes from, but it should not be building paths like that [21:21:14] That's run.php [21:21:15] anyway. i can look into it later [21:21:17] using the classname gives me a not found error, maybe I'm not executing if from the right path? using the file path says it's not executable with maintenance running. [21:21:37] (03PS1) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) [21:21:37] jan_drewniak: I think it's best to call it a day and let someone more confident take over [21:21:46] Who don't seem to be around [21:22:07] I think so, in any case it's deployed to wmf.8, is it ok if it stays there? [21:22:17] (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [21:22:29] yes. the script does nothing by itself, it can stay deployed [21:23:07] It's probably some autoloader screwy-ness [21:23:22] MatmaRex: ok, sorry I don't know what I'm doing 😅better luck tomorrow [21:24:01] that'll teach me not to try to write things in the modern way [21:24:36] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:24:40] PROBLEM - pybal on lvs6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:24:52] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:07] I'd still have expected it to run when called directly via `mwscript` though.. [21:25:49] https://wikitech.wikimedia.org/wiki/Maintenance_server#Run_a_maintenance_script_on_a_wiki [21:25:50] !bug 1 [21:25:50] https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [21:26:00] RhinosF1: hey, so.. can you tell me more about the deployment and the job [21:26:02] jan_drewniak: never apologise for being unsure, best thing to do is say! [21:26:08] RhinosF1: maybe it IS related after all [21:26:13] mutante: deployment aborted anyway [21:26:20] It's a maint script [21:26:22] No one can run it [21:26:23] is it possible this sends email [21:26:23] (03PS1) 10Ottomata: page_content_change - Consume from mediawiki.page_change.v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817) [21:26:25] TheresNoTime: The error makes sense there though, as the stuff that the old way would "need" is indeed missing [21:26:29] mutante: it's not ran [21:26:29] mutante: No [21:26:30] So no [21:26:30] from wiki@wikimedia.org [21:26:41] ok [21:27:15] Reedy: mhm, sorry yes I meant "things written the new way should be backwards compatible unless we've agreed to phase that out@ [21:27:19] mutante: happy to let this channel focus on the page though [21:27:21] s/@/" [21:27:22] well, nevermind then :) [21:27:31] TheresNoTime: Blame MatmaRex for not adding the boilerplate ;D [21:27:34] RhinosF1: no, it's ok, we are using other [21:27:40] tsk [21:27:54] I guess the new method will work fine if it's there... So that is really the fowards compatible way [21:27:57] it's not supposed to be added, is hwat i heard [21:27:57] mutante: cool [21:27:58] anyway [21:27:59] Until we deprecate/remove the old way, and then remove it [21:27:59] i see the bug [21:28:00] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d386f3152201cf26e7e2387094c7321b66d8ff3f/multiversion/MWScript.php#68 [21:28:05] this crap is messing up the class name [21:28:29] you can see it in jan's error message in https://www.irccloud.com/pastebin/CGImOadJ/NewTopicOptOutActiveUsers%20test [21:28:34] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:29:37] there are like 5 ways to run scripts now, eh [21:29:43] *URGH* [21:29:52] only 5? [21:30:00] * bd808 makes a new way [21:30:08] mwscript DiscussionTools:NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run [21:30:12] this will probably work ^ [21:30:15] syntax is hard [21:30:18] * RhinosF1 bowing out for the night, I will dream up new ways [21:30:30] https://wikitech.wikimedia.org/wiki/Maintenance_server#Run_a_maintenance_script_on_a_wiki needs updating [21:30:48] * TheresNoTime only just updated it D: [21:31:04] Reedy: yes it does because I need to write a new mwscript for Miraheze at some point that properly supports this madness [21:31:06] after the *last time* they changed how script ran [21:31:49] MatmaRex: (it didn't fwiw) [21:31:55] heh [21:32:06] I can give it one more shot! I didn't try `mwscript DiscussionTools:NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` (with the colon) [21:32:09] (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4004 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [21:32:21] sync the extension dir fully? [21:32:26] I lose track of what was deployed [21:33:27] probably best to hold off entirely now.. there's Stuff(tm) going on, and that's not an ideal time to guess commands in production [21:33:53] unless it was trying to send emails... it's completely unrelated [21:34:06] even if it was... it's not getting as far as executing the code for it anyway [21:34:07] alright, I'm as curious as anyone, but I'll leave it with MatmaRex then :) [21:35:05] i'll schedule it for another time [21:39:28] (03PS2) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) [21:40:09] (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [21:43:27] (03PS3) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) [21:44:52] RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:44:58] RECOVERY - pybal on lvs6001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:45:06] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:14] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:47:18] PROBLEM - Check systemd state on wdqs2022 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:58] i filed https://phabricator.wikimedia.org/T336819 "Maintenance script designed for run.php syntax cannot be executed in Wikimedia production" [21:50:23] RhinosF1: the mail incident is resolved, fwiw [21:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Awesome work getting it working @volans big thanks to you too :) >>! In T336485#8857232, @Volans wrote: > HTTP i... [22:01:48] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:01:59] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/920374 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [22:04:06] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) (owner: 10MarcoAurelio) [22:08:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10cmooney) [22:13:44] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:56] (03PS1) 10Ottomata: Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) [22:14:38] (03CR) 10CI reject: [V: 04-1] Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [22:15:18] (03PS2) 10Ottomata: Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) [22:21:31] (03PS4) 10MarcoAurelio: dblists: Close akwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) [22:24:56] (03CR) 10Dzahn: [C: 04-1] "this should now happen after we reimaged gerrit2002 and as part of that also moved the data" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [22:25:22] (03CR) 10Dzahn: [C: 03+1] gerrit: remove gerrit1001 from .ssh/config [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:25:34] (03CR) 10Dzahn: [C: 03+1] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh [puppet] - 10https://gerrit.wikimedia.org/r/919402 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:25:49] (03CR) 10Dzahn: [C: 03+1] gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:26:38] (03CR) 10Dzahn: [C: 04-1] "As opposed to other changes that are ready to go we should probably wait here until the host is actually shut down. ?" [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:28:26] jouncebot: nowandnext [22:28:26] No deployments scheduled for the next 7 hour(s) and 31 minute(s) [22:28:26] In 7 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0600) [22:31:26] (03PS1) 10Kimberly Sarabia: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335309) [22:41:01] (03PS2) 10Kimberly Sarabia: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335309) [22:41:03] (03PS3) 10Jdlrobson: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [22:49:53] I think https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874925 can be abandoned now [22:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:52:40] (03CR) 10Jdlrobson: [C: 03+1] Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [22:54:49] (03PS2) 10Jdlrobson: Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [22:55:29] (03CR) 10Jdlrobson: [C: 03+1] "Looks ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [23:05:42] (03PS3) 10MarcoAurelio: Update pnbwiktionary project namespace and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports) [23:05:50] (03CR) 10CI reject: [V: 04-1] Update pnbwiktionary project namespace and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports) [23:19:07] (03CR) 10MarcoAurelio: Update pnbwiktionary project namespace and sitename (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports) [23:25:21] (03CR) 10MarcoAurelio: [C: 04-1] "Hello. Since this patch was uploaded the configuration files have changed a bit. It needs to be rebased and modified accordingly, or maybe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports) [23:37:54] * Krinkle staging on mwdebug1002 [23:53:19] (03CR) 10Tim Starling: [C: 03+1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:57:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ssingh)