[00:00:16] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P48813 and previous config saved to /var/cache/conftool/dbconfig/20230606-000330-ladsgroup.json [00:04:48] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:01] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [00:08:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T336886)', diff saved to https://phabricator.wikimedia.org/P48814 and previous config saved to /var/cache/conftool/dbconfig/20230606-000818-ladsgroup.json [00:08:21] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T336886)', diff saved to https://phabricator.wikimedia.org/P48815 and previous config saved to /var/cache/conftool/dbconfig/20230606-001836-ladsgroup.json [00:18:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [00:18:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:18:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [00:18:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:19:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48816 and previous config saved to /var/cache/conftool/dbconfig/20230606-001914-ladsgroup.json [00:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48817 and previous config saved to /var/cache/conftool/dbconfig/20230606-002125-ladsgroup.json [00:23:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P48818 and previous config saved to /var/cache/conftool/dbconfig/20230606-002324-ladsgroup.json [00:26:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans I checked on the dhcp server in the dhcp config file there is an option that is missing ' for configuration we have now is : `... [00:27:27] (03PS1) 10Andrew Bogott: nove vendordata: purge systemd-resolved libnss-resolve during first boot [puppet] - 10https://gerrit.wikimedia.org/r/927317 (https://phabricator.wikimedia.org/T338192) [00:28:18] (03CR) 10Andrew Bogott: [C: 03+2] nove vendordata: purge systemd-resolved libnss-resolve during first boot [puppet] - 10https://gerrit.wikimedia.org/r/927317 (https://phabricator.wikimedia.org/T338192) (owner: 10Andrew Bogott) [00:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P48819 and previous config saved to /var/cache/conftool/dbconfig/20230606-003631-ladsgroup.json [00:37:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:38:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P48820 and previous config saved to /var/cache/conftool/dbconfig/20230606-003830-ladsgroup.json [00:39:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926561 [00:39:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926561 (owner: 10TrainBranchBot) [00:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P48821 and previous config saved to /var/cache/conftool/dbconfig/20230606-005137-ladsgroup.json [00:53:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T336886)', diff saved to https://phabricator.wikimedia.org/P48822 and previous config saved to /var/cache/conftool/dbconfig/20230606-005336-ladsgroup.json [00:53:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:53:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:53:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [00:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48823 and previous config saved to /var/cache/conftool/dbconfig/20230606-005357-ladsgroup.json [00:56:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926561 (owner: 10TrainBranchBot) [01:02:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T338194 (10phaultfinder) [01:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T336886)', diff saved to https://phabricator.wikimedia.org/P48824 and previous config saved to /var/cache/conftool/dbconfig/20230606-010643-ladsgroup.json [01:06:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [01:06:47] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:06:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [01:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48825 and previous config saved to /var/cache/conftool/dbconfig/20230606-010704-ladsgroup.json [01:18:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:20:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48826 and previous config saved to /var/cache/conftool/dbconfig/20230606-012058-ladsgroup.json [01:21:02] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:29:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48827 and previous config saved to /var/cache/conftool/dbconfig/20230606-013104-ladsgroup.json [01:31:09] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P48828 and previous config saved to /var/cache/conftool/dbconfig/20230606-013604-ladsgroup.json [01:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P48829 and previous config saved to /var/cache/conftool/dbconfig/20230606-014610-ladsgroup.json [01:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P48830 and previous config saved to /var/cache/conftool/dbconfig/20230606-015110-ladsgroup.json [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0200) [02:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P48831 and previous config saved to /var/cache/conftool/dbconfig/20230606-020116-ladsgroup.json [02:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48832 and previous config saved to /var/cache/conftool/dbconfig/20230606-020616-ladsgroup.json [02:06:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:06:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:06:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.12 [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/926562 (https://phabricator.wikimedia.org/T337526) [02:08:15] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.12 [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/926562 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [02:15:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48833 and previous config saved to /var/cache/conftool/dbconfig/20230606-021622-ladsgroup.json [02:16:28] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:19:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:56] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.12 [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/926562 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [02:55:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [02:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T336886)', diff saved to https://phabricator.wikimedia.org/P48834 and previous config saved to /var/cache/conftool/dbconfig/20230606-025507-ladsgroup.json [02:55:10] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T336886)', diff saved to https://phabricator.wikimedia.org/P48835 and previous config saved to /var/cache/conftool/dbconfig/20230606-025717-ladsgroup.json [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0300) [03:02:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [03:03:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P48836 and previous config saved to /var/cache/conftool/dbconfig/20230606-031223-ladsgroup.json [03:16:20] PROBLEM - Check systemd state on doc1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P48837 and previous config saved to /var/cache/conftool/dbconfig/20230606-032729-ladsgroup.json [03:31:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - pt1979@cumin2002" [03:32:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - pt1979@cumin2002" [03:32:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:32:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [03:42:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [03:42:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T336886)', diff saved to https://phabricator.wikimedia.org/P48838 and previous config saved to /var/cache/conftool/dbconfig/20230606-034235-ladsgroup.json [03:42:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [03:42:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:42:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [03:42:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T336886)', diff saved to https://phabricator.wikimedia.org/P48839 and previous config saved to /var/cache/conftool/dbconfig/20230606-034256-ladsgroup.json [03:45:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T336886)', diff saved to https://phabricator.wikimedia.org/P48840 and previous config saved to /var/cache/conftool/dbconfig/20230606-034506-ladsgroup.json [03:49:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P48841 and previous config saved to /var/cache/conftool/dbconfig/20230606-040013-ladsgroup.json [04:08:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [04:13:20] RECOVERY - Check systemd state on doc1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P48842 and previous config saved to /var/cache/conftool/dbconfig/20230606-041520-ladsgroup.json [04:30:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T336886)', diff saved to https://phabricator.wikimedia.org/P48843 and previous config saved to /var/cache/conftool/dbconfig/20230606-043026-ladsgroup.json [04:30:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [04:30:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:30:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [04:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T336886)', diff saved to https://phabricator.wikimedia.org/P48844 and previous config saved to /var/cache/conftool/dbconfig/20230606-043047-ladsgroup.json [04:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T336886)', diff saved to https://phabricator.wikimedia.org/P48845 and previous config saved to /var/cache/conftool/dbconfig/20230606-043358-ladsgroup.json [04:49:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P48846 and previous config saved to /var/cache/conftool/dbconfig/20230606-044904-ladsgroup.json [05:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P48847 and previous config saved to /var/cache/conftool/dbconfig/20230606-050410-ladsgroup.json [05:05:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:18:02] (03CR) 10Tchanders: [C: 03+1] checkuser: Disable client hints feature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) (owner: 10Kosta Harlan) [05:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T336886)', diff saved to https://phabricator.wikimedia.org/P48848 and previous config saved to /var/cache/conftool/dbconfig/20230606-051918-ladsgroup.json [05:19:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [05:19:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:19:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [05:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T336886)', diff saved to https://phabricator.wikimedia.org/P48849 and previous config saved to /var/cache/conftool/dbconfig/20230606-051938-ladsgroup.json [05:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T336886)', diff saved to https://phabricator.wikimedia.org/P48850 and previous config saved to /var/cache/conftool/dbconfig/20230606-052249-ladsgroup.json [05:30:32] (03CR) 10Stevemunene: [C: 03+2] Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [05:34:00] !log ladsgroup@clouddb1021:/srv/sqldata.s1$ sudo rm db1196* (T337961) [05:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:04] T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 [05:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P48851 and previous config saved to /var/cache/conftool/dbconfig/20230606-053755-ladsgroup.json [05:40:00] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Dzahn) [05:48:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2518 [05:49:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 2518 [05:50:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2518 [05:50:12] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 2518 [05:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P48852 and previous config saved to /var/cache/conftool/dbconfig/20230606-055301-ladsgroup.json [06:00:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0600) [06:00:06] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0600). [06:04:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T336886)', diff saved to https://phabricator.wikimedia.org/P48853 and previous config saved to /var/cache/conftool/dbconfig/20230606-060807-ladsgroup.json [06:08:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:08:11] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:08:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:13:28] (03CR) 10Tchanders: [C: 03+2] ipoid: Update for GitLab migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/926424 (https://phabricator.wikimedia.org/T337714) (owner: 10Kosta Harlan) [06:14:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:14:32] (03Merged) 10jenkins-bot: ipoid: Update for GitLab migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/926424 (https://phabricator.wikimedia.org/T337714) (owner: 10Kosta Harlan) [06:15:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:22:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:36:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:36:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:49:26] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:50:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T336886)', diff saved to https://phabricator.wikimedia.org/P48854 and previous config saved to /var/cache/conftool/dbconfig/20230606-065057-ladsgroup.json [06:51:00] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:51:10] (03PS3) 10ArielGlenn: fix up regex comparisons in dumps nfs share testing script [puppet] - 10https://gerrit.wikimedia.org/r/924874 (https://phabricator.wikimedia.org/T325232) [06:52:04] (03CR) 10ArielGlenn: [C: 03+2] fix up regex comparisons in dumps nfs share testing script [puppet] - 10https://gerrit.wikimedia.org/r/924874 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [06:54:37] (03PS2) 10ArielGlenn: fix up more things in the docs for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/924887 (https://phabricator.wikimedia.org/T325232) [06:55:38] (03CR) 10ArielGlenn: [C: 03+2] fix up more things in the docs for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/924887 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [06:56:15] (03CR) 10Elukey: [C: 03+1] "Looks good to me, I added other folks to the change that may be interested (if not sorry! Feel free to drop your name from the list)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [06:57:39] (03CR) 10Elukey: [C: 03+1] Add rate limiting class for WME using LiftWing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [07:00:07] Amir1, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0700). [07:00:07] kostajh and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:23] good morning [07:00:25] o/ [07:01:32] Abijeet should be around in ~30m to help deploy the translate patch [07:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T336886)', diff saved to https://phabricator.wikimedia.org/P48855 and previous config saved to /var/cache/conftool/dbconfig/20230606-070631-ladsgroup.json [07:06:35] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:07:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [07:11:26] hi, I'll be around in another 10 minutes. [07:11:59] hello, I'm here [07:12:22] I'll get started with mine [07:13:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) (owner: 10Kosta Harlan) [07:13:26] (03PS2) 10Kosta Harlan: checkuser: Disable client hints feature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) [07:13:31] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) (owner: 10Kosta Harlan) [07:14:14] (03Merged) 10jenkins-bot: checkuser: Disable client hints feature by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926483 (https://phabricator.wikimedia.org/T337944) (owner: 10Kosta Harlan) [07:14:41] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:926483|checkuser: Disable client hints feature by default (T337944)]] [07:14:44] T337944: Implement support for requesting client hint header - https://phabricator.wikimedia.org/T337944 [07:14:56] (03PS3) 10DCausse: ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) [07:16:09] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:926483|checkuser: Disable client hints feature by default (T337944)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:18:17] (03CR) 10Slyngshede: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/926520 (owner: 10Muehlenhoff) [07:21:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P48856 and previous config saved to /var/cache/conftool/dbconfig/20230606-072137-ladsgroup.json [07:22:55] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:926483|checkuser: Disable client hints feature by default (T337944)]] (duration: 08m 14s) [07:22:59] T337944: Implement support for requesting client hint header - https://phabricator.wikimedia.org/T337944 [07:23:06] ok, I'm done with mine [07:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:24:14] I'm around at my desk now. We can go forward with the deployment [07:24:59] abijeet: hi! I'll ship the patch then [07:25:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse) [07:26:45] (03Merged) 10jenkins-bot: ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse) [07:26:52] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:03] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:922481|ttm: use new config option to separate readable and writable services (T322284)]] [07:27:05] T322284: Translate should have a way to configure readable and writable ttm services separately - https://phabricator.wikimedia.org/T322284 [07:28:24] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:922481|ttm: use new config option to separate readable and writable services (T322284)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:28:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:28:56] abijeet: it should be live on mwdebug servers [07:29:18] checking [07:29:45] Special:SearchTranslations do seem to work for me on meta [07:30:31] testing the update code might require a full deploy to hit the jobrunners [07:31:09] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [07:32:45] tested Special:SearchTranslations with mwdebug200* to hit the elastic cluster in codfw and it seems to work too [07:34:51] (03CR) 10Muehlenhoff: [C: 03+2] "Ran https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook post-merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [07:35:08] translation suggestions also seem to load [07:35:54] great, lemme know when you want to move forward [07:36:18] (03PS1) 10Fabfur: hiera: Swap port 80 from varnish to haproxy on drmrs upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/927580 (https://phabricator.wikimedia.org/T323557) [07:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P48857 and previous config saved to /var/cache/conftool/dbconfig/20230606-073643-ladsgroup.json [07:36:48] OK. Lets move forward. We will have to check the jobrunners once deployed. [07:36:54] sure [07:37:15] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927580 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:37:26] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:39:30] (03CR) 10Vgutierrez: [C: 03+1] "don't forget tot disable puppet in upload@esams before merging this one" [puppet] - 10https://gerrit.wikimedia.org/r/927580 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:39:47] (03CR) 10Fabfur: [C: 03+2] hiera: Swap port 80 from varnish to haproxy on drmrs upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/927580 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:42:24] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:922481|ttm: use new config option to separate readable and writable services (T322284)]] (duration: 15m 20s) [07:42:28] T322284: Translate should have a way to configure readable and writable ttm services separately - https://phabricator.wikimedia.org/T322284 [07:43:00] abijeet: should be live now [07:43:32] thanks [07:44:15] abijeet: do you want to test an update? [07:44:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Add draining script and configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/926455 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [07:45:41] (03CR) 10Tim Starling: "I wrote this because I was thinking about running moveToExternal.php for T299387. I wrote a little script to make a list of target wikis. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [07:47:51] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-upload_esams and A:cp [07:48:00] (03PS2) 10Muehlenhoff: Remove option to manage sources.list [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) [07:48:21] dcausse, I did. Looks good. [07:48:33] (03CR) 10Giuseppe Lavagetto: "LGTM, but please add a big TODO about moving the changes to the mesh module to a new minor version of it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [07:49:18] weird, saw a quick burst of "PHP Fatal Error: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.11/includes/libs/rdbms/exception/DBConnectionError.php'" [07:50:47] abijeet: cool, thanks for all the work on this! [07:51:37] dcausse, thanks for your help. [07:51:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [07:51:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T336886)', diff saved to https://phabricator.wikimedia.org/P48858 and previous config saved to /var/cache/conftool/dbconfig/20230606-075149-ladsgroup.json [07:51:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:51:53] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:52:02] I'm not sure if "PHP Fatal Error: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.11/includes/libs/rdbms/exception/DBConnectionError.php: is related to our work [07:52:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:52:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T336886)', diff saved to https://phabricator.wikimedia.org/P48859 and previous config saved to /var/cache/conftool/dbconfig/20230606-075210-ladsgroup.json [07:52:24] (03PS1) 10Slyngshede: Signup: Email administrator on user creation error. [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 [07:52:28] abijeet: most likely unrelated to translate it was on an unrelated (plwiki) [07:53:07] I think we can close the backport window [07:55:19] (03PS2) 10Slyngshede: Signup: Email administrator on user creation error. [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) [07:59:45] (03PS1) 10Slyngshede: C:IDM Send errors to infrastructure foundations. [puppet] - 10https://gerrit.wikimedia.org/r/927582 (https://phabricator.wikimedia.org/T338008) [08:00:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove option to manage sources.list [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [08:04:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927582 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [08:04:38] (03CR) 10Slyngshede: [C: 03+2] C:IDM Send errors to infrastructure foundations. [puppet] - 10https://gerrit.wikimedia.org/r/927582 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [08:05:57] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [08:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T336886)', diff saved to https://phabricator.wikimedia.org/P48860 and previous config saved to /var/cache/conftool/dbconfig/20230606-080758-ladsgroup.json [08:08:02] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:08:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I like this approach but please add a TODO somewhere about not monkey patching the mesh module." [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [08:13:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [08:13:42] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [thanks folks, I'll close this access request now] [08:15:32] !log installing openssl security updates on bullseye [08:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P48861 and previous config saved to /var/cache/conftool/dbconfig/20230606-082304-ladsgroup.json [08:28:51] (03PS1) 10Slyngshede: Signup: Information on how to report bugs. [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) [08:30:12] 10SRE, 10User-Urbanecm: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10hashar) That one might have caused or sounds related to the train blocker {T338205}. [08:30:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see my question on the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [08:30:29] (03PS1) 10Stevemunene: Revert "Decommission an-worker1058 from hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/927294 [08:31:54] hashar: i apparently fixed one thing and broke the other. changing ownership to deployment should fix this 🙂. uploading a patch... [08:31:58] (03CR) 10Klausman: Add rate limiting class for WME using LiftWing (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [08:33:08] (03PS2) 10TheDJ: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) [08:33:18] (03PS3) 10Klausman: OAuthRateLimiter: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) [08:35:59] (03PS1) 10Urbanecm: fix-staging-perms: Change group owner to deployment [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) [08:37:16] hi, can someone merge ^^ to unbreak train please (T338205)? [08:37:22] T338205: Scap train-presync failed to prepare 1.41.0-wmf.12 - https://phabricator.wikimedia.org/T338205 [08:37:39] maybe claime / jynus ? [08:37:53] Hi, let me check it out [08:37:53] urbanecm: ohhh [08:38:08] urbanecm: so that should usually be the person on clinic duty, but we can help [08:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P48863 and previous config saved to /var/cache/conftool/dbconfig/20230606-083810-ladsgroup.json [08:38:48] ah, okay. sorry :)) [08:39:00] no issue, if it is blocking you, anyone can help [08:39:05] I don't understand what is going on since Puppet assumes the patches directory to be owned by `deployment` [08:39:20] the fix-staging-perms.sh script got fixed last night to change the group to `wikidev` [08:39:40] but apparently it did not fix the current `/srv/patches/1.41.0-wmf.12` [08:39:48] hashar: you input would be required for https://gerrit.wikimedia.org/r/c/operations/puppet/+/927584 [08:39:59] meanwhile I can run a manual command if you want [08:40:31] (03CR) 10Volans: sre.cdn: move common functions to base class (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [08:40:46] (03CR) 10Clément Goubert: fix-staging-perms: Change group owner to deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) (owner: 10Urbanecm) [08:41:21] hashar: let me try to explain: /srv/patches is owned by deployment (and scap was able to create the new subdirectory), but it is unable to ie. write to the .git folder or to the log at /srv/mediawiki-staging/scap/log/history.log, since that's all wikidev owned since the yesterday change [08:42:01] (03PS2) 10Urbanecm: fix-staging-perms: Change group owner to deployment [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) [08:42:05] (03CR) 10Urbanecm: fix-staging-perms: Change group owner to deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) (owner: 10Urbanecm) [08:42:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10MatthewVernon) a:03MatthewVernon [08:42:25] (03CR) 10Hashar: "My understanding is that yes /srv/patches should be owned by group `deployment` T338205#8905226 at least that is how it is setup by Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) (owner: 10Urbanecm) [08:42:39] OHHHHHH OF COURSE [08:43:11] hashar: is that is a +1 for you? [08:43:14] so previously `/srv/patches/.git` was owned by `deployment` [08:43:15] yes [08:43:18] indeed [08:43:26] let me formally +1 [08:43:40] wanted to make sure everbody around was ok with it [08:43:58] (03CR) 10Hashar: [C: 03+1] "Urbanecm explained it to me. The scap train now fails because `/srv/patches/.git` is now owned by wikidev when it should be owned by deplo" [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) (owner: 10Urbanecm) [08:44:30] claime: do you want to deploy or I do? [08:45:26] urbanecm: then I am guessing `/srv/patches` should have a set-group-id flag set to ensure all files below it indeed belong to the `deployment` group? [08:45:44] doing it [08:45:50] (03CR) 10Jcrespo: [C: 03+2] fix-staging-perms: Change group owner to deployment [puppet] - 10https://gerrit.wikimedia.org/r/927584 (https://phabricator.wikimedia.org/T338205) (owner: 10Urbanecm) [08:46:03] hashar: urbanecm, y'all sure you don't want that chgrp to happen to files grp owned by wikidev ? [08:46:19] Ah it was fixed [08:46:21] my bad [08:46:23] :) [08:46:27] I'm waking up, be nice [08:46:28] (03CR) 10Jbond: [C: 03+2] rake_modules: apply early monkey patches earlier [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar) [08:46:57] you were quicker to notice than me to write "i thought i fixed it" :) [08:46:58] the script should be an erb template and the name of the group fed by Puppet (it comes from hiera value `deployment_group`) [08:47:08] I got another pathc pending [08:47:26] (03CR) 10Jbond: [C: 03+2] rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194 (owner: 10Hashar) [08:47:31] (03CR) 10MVernon: [C: 04-2] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/923631 (https://phabricator.wikimedia.org/T337591) (owner: 10RhinosF1) [08:47:49] hashar: in the future, setGID would be helpful, but for now, it should be possible to fix potential permission changes by the fixing script. [08:48:01] jbond: we have a bit of a traffic jam [08:48:02] urbanecm: yeah I am step ahead sorry :] [08:48:28] mine are ok to go not sure about Slyngshede [08:48:46] jynus: yes please [08:48:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [08:48:55] no worries hashar :) [08:49:30] what's the IRC for simon? [08:49:41] slyngs maybe? [08:50:14] (I have a contractor coming at home in a minute so gotta go afk for a bit) [08:50:20] yes slyngs [08:50:21] ok to merge the idm-django-settings.erb change? [08:50:31] once the perm are fixed I think one has to run the `train-presync` systemd unit [08:51:00] or I will run the underlying command `/usr/bin/scap stage-train -Dfull_image_build:True --yes auto` [08:51:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [08:51:52] moritzm: sorry to bother you, ok to deploy b846353784, you +1ed it [08:52:17] it's been there for 30 minutes, he may be afk [08:52:24] jynus: i just checked its safe to merge [08:52:33] yeah, I also thought so [08:52:37] then merging [08:52:41] cheers [08:52:54] you cannot ever be careful enough [08:53:03] +1 [08:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T336886)', diff saved to https://phabricator.wikimedia.org/P48864 and previous config saved to /var/cache/conftool/dbconfig/20230606-085317-ladsgroup.json [08:53:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:53:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:53:28] hashar: urbanecm going to run puppet on deploy host [08:53:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:53:34] (03PS1) 10MVernon: admin: Add nshahquinn-wmf (new name for extant staff member) [puppet] - 10https://gerrit.wikimedia.org/r/927588 (https://phabricator.wikimedia.org/T337591) [08:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T336886)', diff saved to https://phabricator.wikimedia.org/P48865 and previous config saved to /var/cache/conftool/dbconfig/20230606-085337-ladsgroup.json [08:54:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [08:58:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [08:58:26] jynus: sorry, was distracted [08:58:36] no worries, jbond helped me [08:58:41] ack, thx [08:59:49] !log deploy1002: run /usr/local/sbin/fix-staging-perms (T338205) [08:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] T338205: Scap train-presync failed to prepare 1.41.0-wmf.12 - https://phabricator.wikimedia.org/T338205 [08:59:57] (03CR) 10Clément Goubert: [C: 03+1] admin: Add nshahquinn-wmf (new name for extant staff member) [puppet] - 10https://gerrit.wikimedia.org/r/927588 (https://phabricator.wikimedia.org/T337591) (owner: 10MVernon) [09:00:35] hashar: ownership should be correct now; leaving re-running the presync command to you. [09:01:49] (03CR) 10MVernon: [C: 03+2] admin: Add nshahquinn-wmf (new name for extant staff member) [puppet] - 10https://gerrit.wikimedia.org/r/927588 (https://phabricator.wikimedia.org/T337591) (owner: 10MVernon) [09:03:35] urbanecm: I am not seeing that file on the deployment server :-( [09:04:11] what do you mean by that file jynus? [09:04:37] fix-staging-perms.sh is not a deployed file on production? [09:04:47] it is, as ` /usr/local/sbin/fix-staging-perms` [09:05:03] i've already executed it. [09:05:05] ah, I see [09:05:17] I was on the wrong path [09:05:39] did it unblock you ? [09:06:16] it blocked hashar, not me, so that's a question for him. [09:06:53] oh, sorry [09:07:04] so hashar^ anything else? [09:07:15] I _think_ so, the ownership seems fixed now, but i'll leave confirmation to him :) [09:07:59] (03CR) 10Clément Goubert: [C: 03+2] envoy: Add draining script and configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/926455 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [09:08:12] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] envoy: Add draining script and configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/926455 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [09:09:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T336886)', diff saved to https://phabricator.wikimedia.org/P48866 and previous config saved to /var/cache/conftool/dbconfig/20230606-090933-ladsgroup.json [09:09:38] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:11:08] !log Building production images - T338014 [09:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:10] T338014: Allow setting draining strategy and time in envoy image - https://phabricator.wikimedia.org/T338014 [09:13:19] urbanecm: jynus: back sorry [09:13:20] jynus: Sorry, I was out, yes, no issues [09:14:08] would it be possible to manually trigger the `train-presync.timer` on deploy1002 ? [09:14:15] sure [09:14:21] I am guessing deployers don't have the required permission [09:14:23] that's why I was asking :-D [09:14:35] or maybe I can run it with systemctl --user but well hmm [09:14:36] :D [09:14:38] (03PS4) 10Klausman: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) [09:15:05] then if the underlying service pass, that will result in a SUCCESS email with the log output [09:15:14] (03PS1) 10Fabfur: hiera: Swap port 80 from varnish to haproxy on drmrs text cluster [puppet] - 10https://gerrit.wikimedia.org/r/927591 (https://phabricator.wikimedia.org/T323557) [09:15:30] (03CR) 10Klausman: [C: 03+1] changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [09:15:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927591 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:16:20] !log restarting acme-chief and nginx on acme-chief instances [09:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:22] hashar: I did systemctl start train-presync.timer , returned immediately [09:16:32] yeah [09:16:45] cause I guess it was not meant to run now [09:17:26] I will run the command manually [09:17:42] jynus: i think you'd need to do systemctl start train-presync, without the .timer [09:17:43] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Handle pod termination gracefully (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:18:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:20] ok, doing [09:18:43] !log running systemctl start train-presync [09:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927593 (https://phabricator.wikimedia.org/T337526) [09:19:20] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927593 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [09:19:21] looks like something is working [09:19:22] :] [09:19:28] congratulations jynus and urbanecm ! [09:19:30] (03Merged) 10jenkins-bot: mediawiki: Handle pod termination gracefully [deployment-charts] - 10https://gerrit.wikimedia.org/r/925776 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:19:32] (03PS1) 10Zabe: Stop writing to revision_comment_temp in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927594 (https://phabricator.wikimedia.org/T299954) [09:19:50] glad it helped :) [09:20:02] _joe_: That development-charts CI is so fast now <3 [09:20:33] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927593 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [09:20:36] <_joe_> claime: we'll make it slow again, don't worry [09:20:44] lmao [09:20:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) @Papaul that's set already using option 66 that is a standard option, see `tftp-server-name` in `automation.conf`. And the previous comm... [09:21:13] (03Abandoned) 10Slyngshede: LDAP property editor [software/bitu] - 10https://gerrit.wikimedia.org/r/883111 (owner: 10Slyngshede) [09:21:20] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.12 refs T337526 [09:21:23] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [09:22:03] (03CR) 10Vgutierrez: [C: 03+1] hiera: Swap port 80 from varnish to haproxy on drmrs text cluster [puppet] - 10https://gerrit.wikimedia.org/r/927591 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:22:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10MatthewVernon) Hi @nshahquinn-wmf. I've created the new account (and added it to the relevant groups, and made a new krb principal). If you n... [09:22:47] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41558/console" [puppet] - 10https://gerrit.wikimedia.org/r/927139 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [09:24:11] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: run four backups per day [puppet] - 10https://gerrit.wikimedia.org/r/927139 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [09:24:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P48867 and previous config saved to /var/cache/conftool/dbconfig/20230606-092439-ladsgroup.json [09:25:09] (03CR) 10Fabfur: [C: 03+2] hiera: Swap port 80 from varnish to haproxy on drmrs text cluster [puppet] - 10https://gerrit.wikimedia.org/r/927591 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:25:55] I have added a few follow up action on the task https://phabricator.wikimedia.org/T338205 [09:26:02] I will look at implementing them this afternoon [09:26:14] then I guess get those reviewed by others from releng :] [09:26:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:27:03] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:27:08] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-text_esams and A:cp [09:29:12] (03CR) 10Muehlenhoff: Signup: Information on how to report bugs. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:30:54] jynus: if I get it right you end up running the train for us as a result :D [09:30:58] (03CR) 10Muehlenhoff: Signup: Email administrator on user creation error. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:31:12] the docker image is quite slow to upload for whatever reason but that is known ( network traffic https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=deploy1002&var-datasource=thanos&var-cluster=misc&viewPanel=8&from=now-1h&to=now ) [09:31:34] hashar: :-O [09:31:49] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=1) rolling custom on A:cp-text_esams and A:cp [09:31:52] I have journald on a console [09:32:03] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10MatthewVernon) @nskaggs you're the listed approver for the `wmcs-roots` group, are you OK to approve access to that? @lmata are you happy to approve the request from this volun... [09:32:36] it looks network capped on the deploy host, but most probably the bottleneck is with the docker registry [09:32:58] the whole process should take something like 40/50 minutes but is otherwise fully automated [09:33:28] so I guess there is not much to do and at some point we will get an email (and maybe a !log here) [09:34:22] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-text_esams and A:cp [09:34:40] (03PS1) 10Clément Goubert: mw-debug: Bump envoy version for drain tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/927599 (https://phabricator.wikimedia.org/T331609) [09:35:29] I am off for contractor [09:38:08] (03PS2) 10Slyngshede: Signup: Information on how to report bugs. [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) [09:38:14] (03CR) 10Slyngshede: Signup: Information on how to report bugs. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:38:48] (03PS1) 10Jbond: idp-test: add gitlab to idp test [puppet] - 10https://gerrit.wikimedia.org/r/927600 (https://phabricator.wikimedia.org/T320390) [09:39:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-debug: Bump envoy version for drain tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/927599 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:39:42] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Bump envoy version for drain tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/927599 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:39:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P48869 and previous config saved to /var/cache/conftool/dbconfig/20230606-093945-ladsgroup.json [09:40:37] (03Merged) 10jenkins-bot: mw-debug: Bump envoy version for drain tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/927599 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:41:22] (03CR) 10Jbond: [C: 03+2] idp-test: add gitlab to idp test [puppet] - 10https://gerrit.wikimedia.org/r/927600 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:41:35] (03CR) 10Volans: "Is this still worth given all the recent changes to this cookbook? If so I'll rebase and resolve conflicts, if not I'll abandon it." [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [09:41:45] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:41:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:42:50] (03PS3) 10Slyngshede: Signup: Email administrator on user creation error. [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) [09:42:57] (03CR) 10Slyngshede: Signup: Email administrator on user creation error. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:44:39] (03PS1) 10Jbond: gitlab: move gitlab to test idp [puppet] - 10https://gerrit.wikimedia.org/r/927602 (https://phabricator.wikimedia.org/T320390) [09:45:35] (03PS1) 10EoghanGaffney: apt: Update gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/927603 (https://phabricator.wikimedia.org/T338202) [09:45:56] (03CR) 10Slyngshede: [C: 03+1] sre.ganeti.makevm: refactor to simplify expansion (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [09:46:41] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927603 (https://phabricator.wikimedia.org/T338202) (owner: 10EoghanGaffney) [09:47:12] (03CR) 10EoghanGaffney: [C: 03+2] apt: Update gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/927603 (https://phabricator.wikimedia.org/T338202) (owner: 10EoghanGaffney) [09:47:33] (03CR) 10Muehlenhoff: Signup: Information on how to report bugs. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:48:13] jbond: You ok for me to puppetmerge your change? [09:48:41] (03CR) 10Muehlenhoff: Signup: Email administrator on user creation error. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:50:44] (03PS3) 10Slyngshede: Signup: Information on how to report bugs. [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) [09:50:55] (03CR) 10Slyngshede: Signup: Information on how to report bugs. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:51:23] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Information on how to report bugs. [software/bitu] - 10https://gerrit.wikimedia.org/r/927583 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:54:06] (03PS4) 10Slyngshede: Signup: Email administrator on user creation error. [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) [09:54:30] (03CR) 10Slyngshede: Signup: Email administrator on user creation error. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:54:34] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Email administrator on user creation error. [software/bitu] - 10https://gerrit.wikimedia.org/r/927581 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [09:54:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T336886)', diff saved to https://phabricator.wikimedia.org/P48870 and previous config saved to /var/cache/conftool/dbconfig/20230606-095451-ladsgroup.json [09:54:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:54:55] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:55:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T336886)', diff saved to https://phabricator.wikimedia.org/P48871 and previous config saved to /var/cache/conftool/dbconfig/20230606-095512-ladsgroup.json [09:55:34] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > Here are logs from my login to the admin page: https://logstash.wikimedia.org/goto/61b6d364170814f8682de1275d89d767 Perhaps i... [09:56:17] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927605 (https://phabricator.wikimedia.org/T338094) [09:56:32] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927605 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [09:56:34] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:57:23] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927605 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [09:58:09] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:58:55] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:59:33] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1000) [10:00:46] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [10:01:01] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [10:02:13] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [10:06:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:07:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:10:34] (03CR) 10Effie Mouzeli: [C: 03+1] "Cool! Please ping me on IRC when merging this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [10:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T336886)', diff saved to https://phabricator.wikimedia.org/P48872 and previous config saved to /var/cache/conftool/dbconfig/20230606-101205-ladsgroup.json [10:12:09] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:12:20] (03CR) 10Muehlenhoff: "Did some smoke tests on bookworm hosts and works great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [10:13:39] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:13:57] (03CR) 10Jbond: [C: 03+2] trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:14:32] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [10:16:14] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10cmooney) p:05Triage→03Medium a:03cmooney [10:16:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [10:16:34] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10cmooney) a:03cmooney [10:16:42] (03CR) 10Filippo Giunchedi: [C: 03+1] add 0.6.2 ui/package*.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240 (owner: 10Herron) [10:16:59] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10cmooney) a:03cmooney [10:17:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10cmooney) p:05Triage→03Medium a:03cmooney [10:17:45] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.12 refs T337526 (duration: 56m 25s) [10:17:49] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [10:17:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [10:18:17] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:18:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:18:47] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) p:05Triage→03Medium a:03cmooney [10:18:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:19:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:19:46] (03PS4) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [10:20:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:20:06] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.10 (duration: 02m 18s) [10:20:18] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:20:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [10:20:49] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [10:21:26] (03CR) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [10:22:58] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10MatthewVernon) @KFrancis is the NDA still a work-in-progress here? [I'm on clinic duty this week and just want to make sure this ticket isn't waiting on SRE action] [10:24:39] (03PS40) 10Slyngshede: P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) [10:27:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P48873 and previous config saved to /var/cache/conftool/dbconfig/20230606-102712-ladsgroup.json [10:28:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:28:23] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:30:39] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [10:30:56] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [10:33:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [10:37:24] (03CR) 10Clément Goubert: [C: 03+1] Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [10:38:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [10:38:43] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy LLM model falcon-7b-instruct [deployment-charts] - 10https://gerrit.wikimedia.org/r/927611 (https://phabricator.wikimedia.org/T333861) [10:41:22] jouncebot: nowandnext [10:41:22] For the next 0 hour(s) and 18 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1000) [10:41:22] In 2 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [10:41:22] In 2 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [10:41:47] (03CR) 10Jbond: [C: 03+2] ":) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/926464 (owner: 10Jbond) [10:42:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P48874 and previous config saved to /var/cache/conftool/dbconfig/20230606-104218-ladsgroup.json [10:43:15] (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927594 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [10:44:13] (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927594 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [10:44:50] !log zabe@deploy1002 Started scap: Backport for [[gerrit:927594|Stop writing to revision_comment_temp in group1 wikis (T299954)]] [10:44:54] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [10:46:23] !log zabe@deploy1002 zabe: Backport for [[gerrit:927594|Stop writing to revision_comment_temp in group1 wikis (T299954)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:48:32] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927612 (https://phabricator.wikimedia.org/T338094) [10:48:50] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927612 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [10:49:39] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927612 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [10:50:08] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [10:50:11] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [10:50:18] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [10:50:37] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [10:51:20] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [10:51:54] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:927594|Stop writing to revision_comment_temp in group1 wikis (T299954)]] (duration: 07m 03s) [10:51:57] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [10:52:33] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [10:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:56] (03CR) 10Ilias Sarantopoulos: [C: 03+1] changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:53:16] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [10:53:58] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [10:55:11] (03PS1) 10Jbond: wmflib::dump_params: update signature to use optional_repeated_param [puppet] - 10https://gerrit.wikimedia.org/r/927613 [10:56:41] (03CR) 10Jbond: puppetserver: add additional config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:57:12] (03CR) 10CI reject: [V: 04-1] wmflib::dump_params: update signature to use optional_repeated_param [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [10:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T336886)', diff saved to https://phabricator.wikimedia.org/P48875 and previous config saved to /var/cache/conftool/dbconfig/20230606-105724-ladsgroup.json [10:57:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:57:27] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T336886)', diff saved to https://phabricator.wikimedia.org/P48876 and previous config saved to /var/cache/conftool/dbconfig/20230606-105756-ladsgroup.json [10:59:57] (03CR) 10Effie Mouzeli: ipoid: add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [11:01:51] (03PS5) 10Klausman: OAuthRateLimiter: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) [11:03:01] PROBLEM - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:03:01] (03PS1) 10Zabe: Stop writing to revision_comment_temp everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927615 (https://phabricator.wikimedia.org/T299954) [11:03:05] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [11:03:21] (03PS16) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [11:06:15] RECOVERY - Checks that the local airflow scheduler for airflow @platform_eng is working properly on an-airflow1004 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-platform_eng /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1004.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:07:17] (03PS17) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [11:07:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [11:07:56] (03CR) 10Jbond: "Thanks updated, I have also removed all references to nft, i think we can add this in seperate changes" [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [11:09:16] (03PS18) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [11:11:07] (03CR) 10Majavah: firewall: add basic firewall class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [11:11:42] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:13:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T336886)', diff saved to https://phabricator.wikimedia.org/P48877 and previous config saved to /var/cache/conftool/dbconfig/20230606-111313-ladsgroup.json [11:13:16] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:13:17] (03CR) 10Jbond: [V: 03+1 C: 04-1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41559/console" [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [11:15:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:32] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy bloom-3b with AMD GPU support [deployment-charts] - 10https://gerrit.wikimedia.org/r/927620 (https://phabricator.wikimedia.org/T334583) [11:17:31] (03CR) 10Kosta Harlan: ipoid: add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [11:18:05] (03PS1) 10Slyngshede: C:IDM ADMINS must be a list of tuples. [puppet] - 10https://gerrit.wikimedia.org/r/927621 [11:18:31] (03CR) 10Jbond: [C: 04-1] "from pcc: https://puppet-compiler.wmflabs.org/output/927204/41559/" [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [11:18:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but this PCC needs to be validated by David:" [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [11:23:10] (03CR) 10Stevemunene: [C: 03+2] Revert "Decommission an-worker1058 from hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/927294 (owner: 10Stevemunene) [11:23:36] (03PS2) 10KartikMistry: Update MinT to 2023-06-06-111852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337686) [11:24:49] (03CR) 10Jbond: [C: 04-1] "Thanks arturo, ill take your +1 to related to all but the ceph hosts. this is still a -1. With the current change the ceph hosts would be" [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [11:25:00] (03PS2) 10Slyngshede: C:IDM ADMINS must be a list of tuples. [puppet] - 10https://gerrit.wikimedia.org/r/927621 [11:26:20] (03CR) 10Filippo Giunchedi: Alert if there's a big change in image-suggestions compared to yesterday (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [11:27:33] (03CR) 10Filippo Giunchedi: [V: 03+1] profile: exclude kubelet hosts from cadvisor rollout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [11:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P48878 and previous config saved to /var/cache/conftool/dbconfig/20230606-112819-ladsgroup.json [11:31:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:31:40] (03PS5) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 [11:31:42] (03PS1) 10Jbond: P:cloudceph::osd: explicitly set the interface and make route persist [puppet] - 10https://gerrit.wikimedia.org/r/927622 [11:31:46] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:33:19] (03CR) 10Muehlenhoff: firewall: add basic firewall class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [11:34:14] (03PS2) 10Filippo Giunchedi: profile: exclude kubelet production hosts from cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) [11:36:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41560/console" [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [11:37:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:37:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [11:38:12] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:38:14] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:38:51] (03CR) 10Filippo Giunchedi: profile: exclude kubelet production hosts from cadvisor rollout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [11:39:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:42:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:43:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P48879 and previous config saved to /var/cache/conftool/dbconfig/20230606-114327-ladsgroup.json [11:45:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:26] jouncebot: nowandnext [11:47:26] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [11:47:27] In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [11:47:27] In 1 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [11:47:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:48:00] (03CR) 10Kamila Součková: [C: 03+2] OAuthRateLimiter: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [11:48:47] (03Merged) 10jenkins-bot: OAuthRateLimiter: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) (owner: 10Klausman) [11:49:06] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:48] !log kamila@deploy1002 Started scap: Backport for [[gerrit:927218|OAuthRateLimiter: Add rate limiting class for WME using LiftWing (T338121)]] [11:51:52] T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 [11:52:48] jynus: urbanecm: looks like the scap preparation worked fine thank you! [11:53:14] I didn't do anything but running 1 command you told me to [11:53:15] Awesome! [11:53:24] !log kamila@deploy1002 kamila and klausman: Backport for [[gerrit:927218|OAuthRateLimiter: Add rate limiting class for WME using LiftWing (T338121)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [11:54:04] train-presync.service: Succeeded. [11:54:10] Started Perform beginning-of-week train operations. [11:54:18] 10:20:06 Pruned MediaWiki: 1.41.0-wmf.10 (duration: 02m 18s) [11:54:22] 10:20:06 DONE! [11:55:28] (03PS1) 10Filippo Giunchedi: o11y: bump logstash kafka lag threshold [alerts] - 10https://gerrit.wikimedia.org/r/927626 [11:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T336886)', diff saved to https://phabricator.wikimedia.org/P48880 and previous config saved to /var/cache/conftool/dbconfig/20230606-115833-ladsgroup.json [11:58:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:58:37] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:58:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:58:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:59:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T336886)', diff saved to https://phabricator.wikimedia.org/P48881 and previous config saved to /var/cache/conftool/dbconfig/20230606-115911-ladsgroup.json [12:00:43] !log kamila@deploy1002 Finished scap: Backport for [[gerrit:927218|OAuthRateLimiter: Add rate limiting class for WME using LiftWing (T338121)]] (duration: 08m 54s) [12:00:46] T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 [12:01:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927621 (owner: 10Slyngshede) [12:02:20] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Your choice on how tight the rules need to be, but it doesn't seem unreasonable to do it this way, as you say all those IPs are und" [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [12:02:58] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [12:05:25] (03CR) 10Slyngshede: [C: 03+2] C:IDM ADMINS must be a list of tuples. [puppet] - 10https://gerrit.wikimedia.org/r/927621 (owner: 10Slyngshede) [12:09:36] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [12:11:52] (03PS1) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 [12:12:19] (03CR) 10CI reject: [V: 04-1] P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [12:13:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [12:13:34] (03CR) 10David Caro: "run experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [12:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T336886)', diff saved to https://phabricator.wikimedia.org/P48884 and previous config saved to /var/cache/conftool/dbconfig/20230606-121405-ladsgroup.json [12:14:08] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:14:41] (03PS3) 10KartikMistry: Update MinT to 2023-06-06-120533-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337910) [12:15:36] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [12:17:52] (03PS1) 10Esanders: Remove wgDiscussionToolsEnable config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927632 (https://phabricator.wikimedia.org/T322497) [12:18:16] jouncebot: nowandnext [12:18:16] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [12:18:16] In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [12:18:16] In 0 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [12:19:06] !log redeploying 927218 to mw-on-k8s - T338121 [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:10] T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 [12:19:45] !log cgoubert@deploy1002 Started scap: (no justification provided) [12:20:14] (03CR) 10David Caro: P:cloudceph::osd: explicitly set the interface and make route persist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [12:21:54] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41564/console" [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [12:21:56] !log cgoubert@deploy1002 Finished scap: (no justification provided) (duration: 02m 10s) [12:24:41] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [12:24:51] (03CR) 10David Caro: [V: 03+1 C: 03+1] P:cloudceph::osd: explicitly set the interface and make route persist [puppet] - 10https://gerrit.wikimedia.org/r/927622 (owner: 10Jbond) [12:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:29:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P48885 and previous config saved to /var/cache/conftool/dbconfig/20230606-122911-ladsgroup.json [12:29:51] (03CR) 10David Caro: "This would save a lot of yaml :)" [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [12:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:36] (03CR) 10David Caro: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [12:40:13] (03PS4) 10Jkieserman: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [12:41:45] (03PS1) 10Ottomata: EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927647 (https://phabricator.wikimedia.org/T332024) [12:43:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10MoritzMuehlenhoff) It would appear to me that given that we have the full freedom of a Debian-based OS here (after all next to the FLOSS aspect the major perk of SONiC... [12:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P48886 and previous config saved to /var/cache/conftool/dbconfig/20230606-124417-ladsgroup.json [12:46:54] (03PS1) 10Esanders: Remove most DiscussionTools feature configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) [12:48:53] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Andrew) Thanks @Dzahn ! The challenge is to encode that in cloud-init yaml (which may or may not be possible) [12:51:35] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927647 (https://phabricator.wikimedia.org/T332024) (owner: 10Ottomata) [12:51:53] sneaking in a config change before the backport window ^ [12:52:11] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: persist static routes [puppet] - 10https://gerrit.wikimedia.org/r/927210 (https://phabricator.wikimedia.org/T337758) [12:52:55] (03Merged) 10jenkins-bot: EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927647 (https://phabricator.wikimedia.org/T332024) (owner: 10Ottomata) [12:53:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [12:53:38] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [12:54:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: persist static routes [puppet] - 10https://gerrit.wikimedia.org/r/927210 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez) [12:55:11] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [12:55:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [12:56:49] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-upload_esams and A:cp [12:59:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T336886)', diff saved to https://phabricator.wikimedia.org/P48887 and previous config saved to /var/cache/conftool/dbconfig/20230606-125923-ladsgroup.json [12:59:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:59:27] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:59:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:59:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48888 and previous config saved to /var/cache/conftool/dbconfig/20230606-125944-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300). [13:00:04] _joe_, duesen, and klausman: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [13:00:15] i can deploy [13:00:19] I’m in a meeting; can everyone self-serve? [13:00:32] i'm here Lucas :) [13:00:33] <_joe_> Lucas_WMDE: I think so yes [13:00:37] !log eoghan@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [13:00:43] urbanecm: also good :) [13:00:46] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Ottomata) Temporarily disabling hadoop ingestion and canary events. The `ctx` field needs a schema. [13:00:47] <_joe_> so first let me see if my patches were reviewed [13:00:53] but I checked earlier and it looks like they’re all in deployment groups at least ^^ [13:01:09] doesn't seem so _joe_ [13:01:20] <_joe_> *deep sigh* [13:01:29] <_joe_> ok let me proceed with duesen if he's around [13:01:34] <_joe_> and Amir1 as well [13:01:47] <_joe_> claime: are we in the clear re: k8s deployments? [13:01:47] hello [13:01:50] I'm around [13:01:50] (03PS1) 10Slyngshede: Error message: Add custom error messages for 403 and 500. [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 [13:01:58] _joe_: yep [13:02:36] it seems 927218 was already deployed, so _joe_ / duesen have the window all for themselves :)) [13:02:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:03:03] <_joe_> urbanecm: cool, I can handle it [13:03:08] go for it. [13:03:16] (03PS1) 10Muehlenhoff: Point IDP login page to IDM for signup [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/927661 (https://phabricator.wikimedia.org/T338008) [13:03:30] _joe_, urbanecm : I have a conflict, but I'm available at the half hour [13:03:40] is that ok? [13:03:57] <_joe_> duesen: sure, let's do the deployment then [13:04:59] (03PS1) 10Slyngshede: C:IDM switch to read-write LDAP server. [puppet] - 10https://gerrit.wikimedia.org/r/927662 [13:05:16] <_joe_> Amir1: then can I ask you to double-check https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/927115 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/927116/1 so that I can deploy those? [13:05:41] sure [13:06:09] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: EventStreamConfig - Disable canary events and hadoop ingestion for development.network.probe - T332024 (duration: 07m 17s) [13:06:10] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:13] T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 [13:06:32] <_joe_> Amir1: <3 [13:06:41] _joe_: do you want to set $wgParsoidEnableREST = false; [13:06:55] it's being removed from api_apperserver [13:07:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/927662 (owner: 10Slyngshede) [13:07:06] <_joe_> Amir1: yes it's intentional [13:07:21] <_joe_> we actually want to serve parsoid requests from the api servers if requested [13:07:27] Awesome [13:07:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:07:46] (03CR) 10Ladsgroup: [C: 03+1] Load and enable parsoid everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:07:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41566/console" [puppet] - 10https://gerrit.wikimedia.org/r/927662 (owner: 10Slyngshede) [13:08:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle) [13:08:32] ugh, -2 to the other patch, removing the memory I introduced :( [13:08:39] *memory limit [13:09:00] <_joe_> Amir1: lol [13:09:21] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:IDM switch to read-write LDAP server. [puppet] - 10https://gerrit.wikimedia.org/r/927662 (owner: 10Slyngshede) [13:09:36] (03CR) 10Ladsgroup: [C: 03+1] "Looks good while removing the satanic memory limit is making me sad." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:09:52] <_joe_> Amir1: I had a doubt about sync file order [13:10:22] What about it? [13:10:27] <_joe_> we need to sync commonsettings first and initalizesettings later? [13:10:50] (03PS1) 10Andrew Bogott: vendordata: pin puppet packages to wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/927664 (https://phabricator.wikimedia.org/T338195) [13:10:55] order isn't really a concern anymore with scap backport; all changes in the commit take effect at the same time. [13:11:06] <_joe_> urbanecm: that's not 100% accurate [13:11:21] <_joe_> on jobrunners, we still do revalidate opcache [13:11:30] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [13:11:47] (03CR) 10Andrew Bogott: [C: 03+2] vendordata: pin puppet packages to wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/927664 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [13:11:48] <_joe_> but it's ok, the only place where it could be an issue is on parsoid servers, which indeed do revalidate opcache [13:11:56] <_joe_> err, don't [13:12:03] _joe_: yeah, first CS then IS [13:12:26] okay, in that case, thanks for correcting me. [13:12:26] if IS arrives first, it removes the variable definition leading to CS being confused about the memory limit [13:12:44] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:12:48] <_joe_> Amir1: yeah and given we also have jobrunners [13:12:58] <_joe_> uhh claime can you check ^^ [13:13:02] (03CR) 10Herron: [C: 03+1] "SGTM" [alerts] - 10https://gerrit.wikimedia.org/r/927626 (owner: 10Filippo Giunchedi) [13:13:11] yeah [13:13:12] <_joe_> Amir1: so yeah I need to do this the old way [13:13:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Use the parsoid memory limit everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:14:08] Getting 503s [13:14:10] Rerunning to see [13:14:33] (03PS2) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 [13:14:50] <_joe_> ah there is a merge conflict, sigh [13:15:00] httpbb check all good now [13:15:00] (03CR) 10CI reject: [V: 04-1] P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [13:15:08] <_joe_> ok then maybe I can separate this into two patches, actually [13:15:12] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48889 and previous config saved to /var/cache/conftool/dbconfig/20230606-131512-ladsgroup.json [13:15:15] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:15:19] (03CR) 10Jbond: [C: 03+2] firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [13:15:24] (03PS2) 10Giuseppe Lavagetto: Use the parsoid memory limit everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980) [13:15:24] _joe_: going to lunch btw [13:15:29] it's getting late-ish [13:15:31] <_joe_> claime: merci [13:15:35] <_joe_> yes indeed [13:15:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans ok i didn't look in the automation.conf file for the option 66 thanks [13:17:29] is that an actual merge conflict? Almost all of merge conflicts in mw-config repo are lies [13:17:38] (03PS1) 10Stevemunene: Decommission analytics1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/927667 (https://phabricator.wikimedia.org/T338227) [13:18:32] <_joe_> Amir1: ofc it's a lie [13:18:40] <_joe_> should I re-do it as 2 patches? [13:18:45] <_joe_> it's probably safer [13:19:04] sure, at least scap backport would be easier than the old way [13:19:12] (03Abandoned) 10Giuseppe Lavagetto: Use the parsoid memory limit everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:21:08] (03PS1) 10Majavah: idm: fix WMCS spelling [puppet] - 10https://gerrit.wikimedia.org/r/927668 [13:23:14] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:23:59] (03PS1) 10Giuseppe Lavagetto: Raise memory limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) [13:24:01] (03PS1) 10Giuseppe Lavagetto: Drop wmgMemoryLimitParsoid from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927671 [13:24:06] (03PS5) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [13:24:45] <_joe_> Amir1: there they are, there is ofc a small moment when the first patch might cause some requests to have the wrong memory limit [13:24:47] <_joe_> but that is ok [13:25:13] (03CR) 10CI reject: [V: 04-1] Raise memory limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:25:16] awesome [13:25:20] jerkins bot [13:25:21] <_joe_> uhm what did I do wrong [13:25:30] (03PS1) 10Slyngshede: C:IDM Allow Bitu library to write to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/927672 [13:25:57] (03CR) 10CI reject: [V: 04-1] Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [13:26:55] (03PS6) 10Cparle: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) [13:28:12] (03PS2) 10Giuseppe Lavagetto: Raise memory limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) [13:28:14] (03PS2) 10Giuseppe Lavagetto: Drop wmgMemoryLimitParsoid from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927671 [13:29:10] <_joe_> Amir1: we should be there finally. [13:29:11] (03PS1) 10Hashar: fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) [13:29:13] (03PS1) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [13:29:15] (03PS1) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [13:29:16] coool [13:30:14] (03CR) 10Kamila Součková: [C: 03+1] changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:30:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P48890 and previous config saved to /var/cache/conftool/dbconfig/20230606-133018-ladsgroup.json [13:30:21] (03CR) 10Ladsgroup: [C: 03+1] Raise memory limit to match parsoid (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:30:43] (03PS1) 10Fabfur: icinga: Add fabfur to permession lists [puppet] - 10https://gerrit.wikimedia.org/r/927677 [13:31:27] _joe_: back now [13:31:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:32:01] <_joe_> duesen: I'm merging a couple other patches, then we'll get to yours [13:32:08] ok [13:32:17] (03PS1) 10Ssingh: lvs2013: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/927678 (https://phabricator.wikimedia.org/T326767) [13:32:31] (03CR) 10Ladsgroup: [C: 03+1] Drop wmgMemoryLimitParsoid from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927671 (owner: 10Giuseppe Lavagetto) [13:32:57] (03CR) 10Herron: [C: 03+2] add 0.6.2 ui/package*.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240 (owner: 10Herron) [13:33:00] (03CR) 10Herron: [V: 03+2 C: 03+2] add 0.6.2 ui/package*.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240 (owner: 10Herron) [13:33:03] (03Merged) 10jenkins-bot: Raise memory limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927670 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [13:33:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [13:33:17] (03CR) 10Herron: [V: 03+2 C: 03+2] "cheers thanks for the help on this" [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:33:31] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:927670|Raise memory limit to match parsoid (T334980)]] [13:33:34] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [13:34:55] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e1-eqiad.mgmt,lsw1-f[1-2]-eqiad.mgmt with reason: Migrate lsw1-f2-eqiad uplinks to spine [13:35:06] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:927670|Raise memory limit to match parsoid (T334980)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:35:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e1-eqiad.mgmt,lsw1-f[1-2]-eqiad.mgmt with reason: Migrate lsw1-f2-eqiad uplinks to spine [13:35:14] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cc1782a9-2e71-42dd-977e-e3f886320173) set by cmooney@cumin1001 for 0:30:00 on 3 host(s... [13:35:41] (03CR) 10Elukey: [C: 03+2] changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:37:07] (03CR) 10Cparle: [C: 03+2] Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [13:38:59] (03Merged) 10jenkins-bot: Alert if there's a big change in image-suggestions compared to yesterday [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [13:39:01] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power supply [13:39:05] (03CR) 10JHathaway: [C: 03+2] bookworm: Change to deb822 format for sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [13:39:31] (03CR) 10Ladsgroup: Fix some mwscript bugs and clean up the style (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [13:39:41] (03PS2) 10Ssingh: ntp: do not restart the ntp service on conf change [puppet] - 10https://gerrit.wikimedia.org/r/926598 [13:40:29] <_joe_> duesen: I should be with you in ~ 10 minutes [13:40:41] (03PS3) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 [13:40:43] (03CR) 10Effie Mouzeli: ipoid: add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [13:40:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41568/console" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [13:40:54] (03CR) 10Jbond: "thanks for the feedback see inline" [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [13:41:00] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [13:41:05] (03CR) 10CI reject: [V: 04-1] P:cloudceph::osd: drop the profile::cloudceph::osd::hosts [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [13:41:13] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [13:41:24] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:927670|Raise memory limit to match parsoid (T334980)]] (duration: 07m 53s) [13:41:27] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [13:41:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:42:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927671 (owner: 10Giuseppe Lavagetto) [13:42:39] _joe_: I have a meeting coming up at the top of the hour. I can still monitor, but I can't do the deployment then. [13:43:06] (03Merged) 10jenkins-bot: Drop wmgMemoryLimitParsoid from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927671 (owner: 10Giuseppe Lavagetto) [13:43:15] <_joe_> I can do the deployment, I just wanted you to be around to check what happens as soon as we deploy [13:43:32] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:927671|Drop wmgMemoryLimitParsoid from IS.php]] [13:45:08] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:927671|Drop wmgMemoryLimitParsoid from IS.php]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:45:25] _joe_: ok. I will be keeping an eye on the graphs at the bottom of https://grafana-rw.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard [13:45:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P48891 and previous config saved to /var/cache/conftool/dbconfig/20230606-134524-ladsgroup.json [13:45:54] <_joe_> duesen: yeah hopefully we can get to the deployment before you're in the meeting [13:46:21] (03CR) 10Andrew Bogott: "This breaks rabbit clustering in eqiad, where the nodes need to talk to each other (and other hosts) via public IPs. Reverting." [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [13:46:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:39] (03PS1) 10Andrew Bogott: Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support" [puppet] - 10https://gerrit.wikimedia.org/r/927690 [13:47:55] (03PS1) 10David Caro: toolforge: add the common config file for clis [puppet] - 10https://gerrit.wikimedia.org/r/927680 [13:48:10] (03PS2) 10Andrew Bogott: Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support" [puppet] - 10https://gerrit.wikimedia.org/r/927690 [13:48:11] !Log migrating lsw1-f2-eqiad uplinks to spine switches T322937 [13:48:12] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [13:49:13] <_joe_> jouncebot: now [13:49:13] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [13:49:13] For the next 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1300) [13:49:17] zuul's a little busy atm 👀 [13:49:23] yup ._. [13:49:35] * TheresNoTime says, +2ing another core patch [13:49:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [13:49:56] (03PS2) 10Giuseppe Lavagetto: Enable parser cache warming jobs for parsoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:49:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [13:50:20] (03PS1) 10Muehlenhoff: Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/927681 [13:50:26] <_joe_> TheresNoTime: behave, I have to merge deployments :D [13:50:53] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:927671|Drop wmgMemoryLimitParsoid from IS.php]] (duration: 07m 21s) [13:51:05] (03PS1) 10Jbond: puppetboard::bookworm: use correct vhost [puppet] - 10https://gerrit.wikimedia.org/r/927682 [13:51:13] (03CR) 10Andrew Bogott: [C: 03+2] Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support" [puppet] - 10https://gerrit.wikimedia.org/r/927690 (owner: 10Andrew Bogott) [13:51:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [13:51:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/927681 (owner: 10Muehlenhoff) [13:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:35] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [13:52:03] (03CR) 10Jbond: [C: 03+2] firewall: add basic firewall class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [13:52:12] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: use correct vhost [puppet] - 10https://gerrit.wikimedia.org/r/927682 (owner: 10Jbond) [13:52:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:52:34] <_joe_> duesen: ^^ [13:53:14] (03Merged) 10jenkins-bot: Enable parser cache warming jobs for parsoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:53:40] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] [13:53:43] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [13:55:05] !log oblivian@deploy1002 oblivian and daniel: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:55:56] _joe_: I'm waiting for the enqueue rate to go up 400%... [13:56:07] <_joe_> duesen: a few mins [13:56:14] yea [13:56:33] <_joe_> right now it's being deployed [13:56:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:39] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard.wikimedia.org:443/ - 580 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:57:15] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AndyRussG out of all services on: 1259 hosts [13:57:35] (03PS1) 10JHathaway: lists: Use stock mailman3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) [13:57:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [13:58:24] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/927668 (owner: 10Majavah) [13:58:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 1259 hosts [13:58:35] <_joe_> duesen: numbers should start going up soon-ish [13:58:44] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AndyRussG out of all services on: 780 hosts [13:59:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 780 hosts [14:00:30] _joe_: going up... [14:00:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48893 and previous config saved to /var/cache/conftool/dbconfig/20230606-140030-ladsgroup.json [14:00:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:00:34] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:00:36] (03PS1) 10Clément Goubert: mediawiki: Test sleeping before draining in envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/927685 (https://phabricator.wikimedia.org/T331609) [14:00:38] (03CR) 10BBlack: "After confirming with jbond about puppet5 ordering stuff, I think I'm changing my mind about what we should be doing in this patch for the" [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973) (owner: 10BCornwall) [14:00:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:00:46] <_joe_> duesen: indeed [14:00:50] _joe_: are you keeping an eye on jobrunner load? [14:00:51] <_joe_> jouncebot: nowandnext [14:00:51] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [14:00:51] In 1 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1600) [14:00:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48894 and previous config saved to /var/cache/conftool/dbconfig/20230606-140051-ladsgroup.json [14:01:01] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:06] (03CR) 10Muehlenhoff: lists: Use stock mailman3 on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:01:07] <_joe_> duesen: as soon as numbers become ludicrous, yes [14:01:37] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] (duration: 07m 57s) [14:01:40] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [14:01:56] (03PS1) 10Effie Mouzeli: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927708 (https://phabricator.wikimedia.org/T337649) [14:02:00] <_joe_> duesen: but basically https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&from=now-30m&to=now&viewPanel=54&refresh=1m [14:02:08] (03CR) 10CI reject: [V: 04-1] thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927708 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:03:02] <_joe_> duesen: job concurrency is going up [14:03:05] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [14:03:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [14:03:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1026.eqiad.wmnet with OS bullseye [14:03:47] (03CR) 10Ssingh: dnsbox: bind hc to pdns-recursor and gdnsd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973) (owner: 10BCornwall) [14:03:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye [14:04:55] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:18] <_joe_> ok I'm going to merge the last patch and give up for today [14:05:29] (03PS2) 10Giuseppe Lavagetto: Load and enable parsoid everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980) [14:05:52] _joe_: enqueue rate is leveling off at ~120 jobs/sec, up from 50. Much less than I expected. concurrency seems to be stabilizing around 9, up from 4. [14:05:58] _joe_: does that sound acceptable? [14:06:03] <_joe_> duesen: totally [14:06:11] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e1-eqiad.mgmt,lsw1-f[1,3]-eqiad.mgmt with reason: Migrate lsw1-f2-eqiad uplinks to spine [14:06:13] <_joe_> that's why I moved on :D [14:06:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e1-eqiad.mgmt,lsw1-f[1,3]-eqiad.mgmt with reason: Migrate lsw1-f2-eqiad uplinks to spine [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:33] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ff119eb0-a98a-450e-a695-debe4cc1b997) set by cmooney@cumin1001 for 0:30:00 on 3 host(s... [14:06:35] <_joe_> duesen: you can probably expect a similar jump when we add dewiki [14:07:11] _joe_: excellent, thank you! I would expect the number of jobs to double again when we add the rest of the large wikis. Does that sound like a decent estimate? enwiki+frwiki = half of "large"? [14:07:38] <_joe_> something less probably [14:07:46] (03CR) 10BBlack: [C: 03+1] "Thanks! This will make things betterer!" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [14:07:51] <_joe_> but if we exclude wikidata, I would be ok with moving everything at this point [14:07:51] (03CR) 10JHathaway: "I like this version better, however I do not think it is intuitive that wmflib::dump_params('foo', 'bar') is filtering out the passed in p" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [14:08:19] (03CR) 10Kamila Součková: [C: 03+1] mediawiki: Test sleeping before draining in envoy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927685 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [14:08:22] _joe_: yes, we will exclude wikidata and commons. [14:08:28] (03PS1) 10Ssingh: P:wikidough: update location of nrpe::plugin for check [puppet] - 10https://gerrit.wikimedia.org/r/927711 [14:08:31] _joe_: ...so I schedule the rest for tomorrow? [14:08:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [14:08:41] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Test sleeping before draining in envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/927685 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [14:08:46] !log eoghan@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [14:08:49] <_joe_> duesen: sure [14:09:35] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Add fabfur to permession lists [puppet] - 10https://gerrit.wikimedia.org/r/927677 (owner: 10Fabfur) [14:09:37] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:57] (03Merged) 10jenkins-bot: Load and enable parsoid everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980) (owner: 10Giuseppe Lavagetto) [14:10:02] (03Merged) 10jenkins-bot: mediawiki: Test sleeping before draining in envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/927685 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [14:10:07] (03PS1) 10Effie Mouzeli: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) [14:10:10] !Log migrating lsw1-f3-eqiad uplinks to spine switches T322937 [14:10:13] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [14:10:24] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:927116|Load and enable parsoid everywhere (T334980)]] [14:10:26] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [14:12:00] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:927116|Load and enable parsoid everywhere (T334980)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:12:08] (03PS1) 10Hubaishan: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) [14:12:10] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [14:12:23] duesen, _joe_: is https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=27 related to the above? graph looks scary to me, wondering if it's okay or at least known [14:12:24] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41569/console" [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [14:13:20] <_joe_> kamila_: that is change-propagation, not changeprop-jobqueue, correct? [14:13:25] kamila_: wow, that does look scary! But that explosion started way before the cache warmign job got enabled.... [14:13:56] _joe_: I think so [14:14:14] duesen: right, it's earlier... I don't see anything obviously related in SAL [14:14:16] (03CR) 10JHathaway: [C: 03+1] "look good!" [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [14:14:17] <_joe_> kamila_: yes that doesn't look good, can you ask akosiaris to help you? [14:14:29] (03CR) 10Ssingh: [C: 03+2] lvs2013: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/927678 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [14:14:39] <_joe_> I have a patch mid-deployment right now [14:14:50] _joe_: will do, thank you [14:15:12] _joe_: queue wait tim is still kreeping upwards [14:15:20] creeping even [14:15:40] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [14:16:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2013.codfw.wmnet with OS bullseye [14:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48895 and previous config saved to /var/cache/conftool/dbconfig/20230606-141601-ladsgroup.json [14:16:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:16:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye [14:16:22] <_joe_> duesen: it's below 1 second [14:16:31] <_joe_> duesen: when it's over 5 minutes, ping me :D [14:16:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:30] ok [14:19:10] (03CR) 10Jbond: "i think now the preceading patch is merged this one should be good as well" [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:19:16] (03CR) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:21:40] (03CR) 10Fabfur: [C: 03+2] icinga: Add fabfur to permession lists [puppet] - 10https://gerrit.wikimedia.org/r/927677 (owner: 10Fabfur) [14:25:25] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:927116|Load and enable parsoid everywhere (T334980)]] (duration: 15m 00s) [14:25:28] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [14:25:35] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard2003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard.wikimedia.org:443/ - 580 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:27:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/927661 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:27:11] (03Abandoned) 10Effie Mouzeli: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927708 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:28:05] (03PS1) 10Jbond: puppetboard::bookworm: add puppetboard-next to sni [puppet] - 10https://gerrit.wikimedia.org/r/927714 [14:28:15] (03CR) 10Jbond: [C: 03+2] puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [14:29:13] (03PS1) 10Fabfur: hiera: Swap port 80 from varnish to haproxy in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) [14:29:49] kamila_: looks like there have been similar spikes over the last month or so [14:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P48896 and previous config saved to /var/cache/conftool/dbconfig/20230606-143107-ladsgroup.json [14:31:27] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:31:43] (03CR) 10Jbond: [C: 03+2] puppetboard::bookworm: add puppetboard-next to sni [puppet] - 10https://gerrit.wikimedia.org/r/927714 (owner: 10Jbond) [14:31:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [14:35:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [14:37:22] (03CR) 10Fabfur: [C: 04-2] "Please do not merge until 07-06-2023" [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:37:32] (03CR) 10Vgutierrez: [C: 03+1] varnishkafka: add catch all systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:41:36] duesen: good point, thanks [14:42:48] (03PS1) 10Jbond: idp: add netbox-next to list of services [puppet] - 10https://gerrit.wikimedia.org/r/927719 [14:43:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Remove single quotes around values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:44:19] (03CR) 10Effie Mouzeli: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:45:17] (03PS2) 10Effie Mouzeli: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) [14:45:52] (03PS1) 10JHathaway: apt::repository: fix ensurable [puppet] - 10https://gerrit.wikimedia.org/r/927720 (https://phabricator.wikimedia.org/T330495) [14:46:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P48897 and previous config saved to /var/cache/conftool/dbconfig/20230606-144614-ladsgroup.json [14:46:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/927719 (owner: 10Jbond) [14:46:42] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927720 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [14:47:02] (03CR) 10Muehlenhoff: [C: 03+1] "Oh, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/927720 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [14:49:21] (03CR) 10David Caro: [C: 03+2] toolforge: add the common config file for clis [puppet] - 10https://gerrit.wikimedia.org/r/927680 (owner: 10David Caro) [14:49:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:51:17] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:51:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2013.codfw.wmnet with OS bullseye [14:51:21] (03CR) 10JHathaway: [C: 03+2] apt::repository: fix ensurable [puppet] - 10https://gerrit.wikimedia.org/r/927720 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [14:51:28] (03PS1) 10Clément Goubert: mediawiki: restore sleep after envoy drain [deployment-charts] - 10https://gerrit.wikimedia.org/r/927722 (https://phabricator.wikimedia.org/T331609) [14:51:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye completed:... [14:51:56] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change entries for moved links eqiad row e f switches - cmooney@cumin1001" [14:52:16] (03Merged) 10jenkins-bot: thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/927712 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [14:52:19] (03PS1) 10Btullis: Update the maintain-views script to improve the table selection option [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) [14:53:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change entries for moved links eqiad row e f switches - cmooney@cumin1001" [14:53:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:06] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:53:12] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:53:25] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:53:31] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:53:45] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:53:49] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:55:40] jouncebot: nowandnext [14:55:40] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [14:55:40] In 1 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1600) [14:56:43] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:57:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye [14:57:28] (03PS2) 10Cathal Mooney: Add KCVelaga to analytics-product-users users group [puppet] - 10https://gerrit.wikimedia.org/r/926543 (https://phabricator.wikimedia.org/T337766) [14:57:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1026.eqiad.wmnet with OS bullseye [14:58:02] (03CR) 10Majavah: [C: 04-1] Update the maintain-views script to improve the table selection option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [14:59:09] (03PS1) 10Ssingh: sites.yaml: add new LVS host lvs2013 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/927725 (https://phabricator.wikimedia.org/T326767) [14:59:17] (03CR) 10Cathal Mooney: [C: 03+2] Add KCVelaga to analytics-product-users users group [puppet] - 10https://gerrit.wikimedia.org/r/926543 (https://phabricator.wikimedia.org/T337766) (owner: 10Cathal Mooney) [14:59:31] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: restore sleep after envoy drain [deployment-charts] - 10https://gerrit.wikimedia.org/r/927722 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [15:00:44] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [15:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T336886)', diff saved to https://phabricator.wikimedia.org/P48898 and previous config saved to /var/cache/conftool/dbconfig/20230606-150120-ladsgroup.json [15:01:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:01:24] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:01:27] (03Merged) 10jenkins-bot: mediawiki: restore sleep after envoy drain [deployment-charts] - 10https://gerrit.wikimedia.org/r/927722 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [15:01:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:01:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T336886)', diff saved to https://phabricator.wikimedia.org/P48899 and previous config saved to /var/cache/conftool/dbconfig/20230606-150141-ladsgroup.json [15:01:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile: exclude kubelet production hosts from cadvisor rollout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [15:02:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:19] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:02:22] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:03:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:20] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [15:03:29] effie ? [15:03:58] thumbor is having. hard time moving forward in life because of global mem limits [15:04:24] ack'ing page [15:04:38] (03PS2) 10Jbond: wmflib::dump_params: update signature to use optional_repeated_param [puppet] - 10https://gerrit.wikimedia.org/r/927613 [15:04:57] hello [15:05:04] In one hour, gitlab will be unavailable for around 5 minutes to allow for a minor upgrade. [15:05:59] (03PS8) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) [15:06:32] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@72d9b87]: (no justification provided) [15:06:34] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:06:42] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@72d9b87]: (no justification provided) (duration: 00m 10s) [15:06:44] (03CR) 10CI reject: [V: 04-1] wmflib::dump_params: update signature to use optional_repeated_param [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [15:06:48] (03CR) 10Jbond: wmflib::dump_params: update signature to use optional_repeated_param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [15:06:57] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:07:00] (03CR) 10Elukey: varnishkafka: add catch all systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [15:07:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:07:33] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:07:41] (03CR) 10Jbond: [C: 03+2] idp: add netbox-next to list of services [puppet] - 10https://gerrit.wikimedia.org/r/927719 (owner: 10Jbond) [15:08:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:27] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:08:50] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [15:10:47] (03CR) 10Ahmon Dancy: "Seems like fix-staging-perms.sh should be converted into a puppet template." [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:12:03] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338236 (10phaultfinder) [15:12:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-text_esams and A:cp [15:12:38] (03CR) 10Ahmon Dancy: scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:13:14] (03PS1) 10Effie Mouzeli: admin: bump thumbor namespace's limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/927729 [15:14:24] (03CR) 10Ahmon Dancy: fix-staging-perms: set set-group-id on /srv/patches subdirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:16:17] (03PS1) 10Effie Mouzeli: thumbor: add more expensive workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/927730 (https://phabricator.wikimedia.org/T337649) [15:16:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T336886)', diff saved to https://phabricator.wikimedia.org/P48900 and previous config saved to /var/cache/conftool/dbconfig/20230606-151633-ladsgroup.json [15:16:37] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:18:18] (03PS1) 10Ottomata: Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927692 [15:19:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:19:46] jouncebot: nowandnext [15:19:47] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [15:19:47] In 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1600) [15:19:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:20:02] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [15:20:05] (03PS7) 10Zabe: Change project logo for Wikimania to Wikimania 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (https://phabricator.wikimedia.org/T337044) (owner: 10Robertsky) [15:20:36] (03CR) 10Zabe: [C: 03+2] Change project logo for Wikimania to Wikimania 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (https://phabricator.wikimedia.org/T337044) (owner: 10Robertsky) [15:20:42] (03PS1) 10Jbond: puppetmaster: remove the min submission [puppet] - 10https://gerrit.wikimedia.org/r/927731 (https://phabricator.wikimedia.org/T330490) [15:21:06] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2013 [15:21:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs2013 [15:21:44] (03PS2) 10Jbond: puppetmaster: remove the min submission [puppet] - 10https://gerrit.wikimedia.org/r/927731 (https://phabricator.wikimedia.org/T330490) [15:21:58] (03Merged) 10jenkins-bot: Change project logo for Wikimania to Wikimania 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (https://phabricator.wikimedia.org/T337044) (owner: 10Robertsky) [15:22:15] jouncebot: nowandnext [15:22:16] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [15:22:16] In 0 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1600) [15:22:28] !log zabe@deploy1002 Started scap: Backport for [[gerrit:921610|Change project logo for Wikimania to Wikimania 2023 version (T337044)]] [15:22:31] T337044: Change project logo for Wikimania to Wikimania 2023 version - https://phabricator.wikimedia.org/T337044 [15:22:40] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new LVS host lvs2013 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/927725 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [15:23:22] <_joe_> uh there is a scap deployment ongoing [15:23:39] <_joe_> sukhe: can you wait before puppet-merging? [15:24:07] sorry, is there a something happening currently? there is no other window. [15:24:19] !log zabe@deploy1002 robertsky and zabe: Backport for [[gerrit:921610|Change project logo for Wikimania to Wikimania 2023 version (T337044)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:24:29] <_joe_> zabe: I *think* we shouldn't be at risk [15:24:34] _joe_: this is just for the homer change, the host is already up [15:24:45] <_joe_> sukhe: oh ok [15:24:56] <_joe_> sukhe: and in any case, it would be a good test of our fixes [15:25:00] indeed :) [15:25:10] btw, the provisioning was happening during your previous deploy [15:25:21] and I think everything looks(ed) OK [15:25:24] (03CR) 10Jbond: [C: 04-1] "-1: need to update the script name" [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [15:26:04] (03CR) 10Jbond: [C: 03+2] puppetmaster: remove the min submission [puppet] - 10https://gerrit.wikimedia.org/r/927731 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:26:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: bump thumbor namespace's limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/927729 (owner: 10Effie Mouzeli) [15:26:56] !log homer "cr*-codfw*" commit "Gerrit: 927725 add new LVS host lvs2013" : T326767 [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:01] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [15:28:27] (03CR) 10Krinkle: eventlogging: remove CentralNoticeTiming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [15:28:49] (03CR) 10Ssingh: [V: 03+1] "Thank you, nice catch!" [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [15:28:55] (03PS5) 10Krinkle: eventlogging: remove CentralNoticeTiming [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [15:29:16] (03CR) 10Krinkle: [C: 03+1] eventlogging: remove CentralNoticeTiming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [15:29:55] (03PS2) 10JHathaway: lists: Use stock mailman3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) [15:29:57] (03CR) 10Krinkle: [C: 03+1] eventlogging: remove CentralNoticeTiming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [15:30:17] (03CR) 10Effie Mouzeli: [C: 03+2] admin: bump thumbor namespace's limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/927729 (owner: 10Effie Mouzeli) [15:30:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [15:30:31] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:921610|Change project logo for Wikimania to Wikimania 2023 version (T337044)]] (duration: 08m 02s) [15:30:35] T337044: Change project logo for Wikimania to Wikimania 2023 version - https://phabricator.wikimedia.org/T337044 [15:30:41] zabe: all done? [15:30:47] yup [15:30:52] thanks! [15:30:54] (03CR) 10CDanis: [C: 03+2] Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927692 (owner: 10Ottomata) [15:31:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cdanis@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927692 (owner: 10Ottomata) [15:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P48901 and previous config saved to /var/cache/conftool/dbconfig/20230606-153139-ladsgroup.json [15:31:47] (03Merged) 10jenkins-bot: Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927692 (owner: 10Ottomata) [15:32:13] !log cdanis@deploy1002 Started scap: Backport for [[gerrit:927692|Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion"]] [15:32:46] !log purge wikimaniawiki logos # T337044 [15:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] (03Merged) 10jenkins-bot: admin: bump thumbor namespace's limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/927729 (owner: 10Effie Mouzeli) [15:34:03] !log cdanis@deploy1002 cdanis and otto: Backport for [[gerrit:927692|Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:34:12] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:34:14] (03CR) 10JHathaway: [C: 03+1] wmflib::dump_params: update signature to use optional_repeated_param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [15:34:17] (03PS3) 10Ssingh: ntp: do not restart the ntp service on conf change [puppet] - 10https://gerrit.wikimedia.org/r/926598 [15:34:19] (03PS2) 10Ssingh: P:wikidough: update location of nrpe::plugin for check [puppet] - 10https://gerrit.wikimedia.org/r/927711 [15:35:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41571/console" [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [15:35:22] (03CR) 10JHathaway: lists: Use stock mailman3 on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [15:35:31] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:35:56] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:36:08] (03CR) 10Jbond: wmflib::dump_params: update signature to use optional_repeated_param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [15:37:14] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:37:18] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41572/console" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [15:37:53] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) @KCVelaga_WMF my apologies for the delay getting this set up for you. You've been added to the requested group now. Please test the acce... [15:38:09] (03CR) 10Jbond: vendordata: pin puppet packages to wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927664 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [15:38:43] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:38:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [15:39:28] (03CR) 10JHathaway: [C: 03+1] wmflib::dump_params: update signature to use optional_repeated_param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [15:40:26] !log cdanis@deploy1002 Finished scap: Backport for [[gerrit:927692|Revert "EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion"]] (duration: 08m 13s) [15:45:01] (03PS1) 10Ssingh: hiera: remove lvs2013's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/927735 (https://phabricator.wikimedia.org/T326767) [15:46:09] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:46:17] (03CR) 10Dzahn: [C: 03+2] delete gerrit-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/927267 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [15:46:21] (03PS2) 10Dzahn: delete gerrit-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/927267 (https://phabricator.wikimedia.org/T336427) [15:46:23] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P48902 and previous config saved to /var/cache/conftool/dbconfig/20230606-154645-ladsgroup.json [15:47:24] (03PS2) 10Effie Mouzeli: thumbor: add more expensive workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/927730 (https://phabricator.wikimedia.org/T337649) [15:52:34] !log jbond@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:53:17] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:54:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [15:56:33] (03PS1) 10Volans: WIP: first scaffolding fo gNMI support [software/homer] (gnmi) - 10https://gerrit.wikimedia.org/r/927736 [16:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:07] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T336886)', diff saved to https://phabricator.wikimedia.org/P48904 and previous config saved to /var/cache/conftool/dbconfig/20230606-160151-ladsgroup.json [16:01:59] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:03:24] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:04:14] Reminder: Brief gitlab outage in the next few minutes to complete a minor upgrade. [16:04:45] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:00] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:06:47] (03PS2) 10Btullis: Update the maintain-views script to improve the table selection option [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) [16:07:26] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10nskaggs) Yes, approved for wmcs-roots. [16:08:40] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: add more expensive workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/927730 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [16:09:31] (03Merged) 10jenkins-bot: thumbor: add more expensive workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/927730 (https://phabricator.wikimedia.org/T337649) (owner: 10Effie Mouzeli) [16:10:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [16:11:53] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3326 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [16:12:04] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10nskaggs) >>! In T337829#8899615, @taavi wrote: > From experience: `wmcs-roots` is basically useless for wiki replica work. This matches my experience and needs to change. I wan... [16:12:26] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:12:28] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab to 15.10.8 [16:13:25] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 150419 bytes in 0.779 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [16:14:07] (ProbeDown) firing: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:38] The gitlab alerts are known from the upgrade, it should be complete now. [16:22:00] (03CR) 10Cwhite: [C: 03+2] team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [16:23:01] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:23:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:27] (03Merged) 10jenkins-bot: team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [16:23:52] (03CR) 10Ssingh: [C: 03+2] hiera: remove lvs2013's bgp-med override [puppet] - 10https://gerrit.wikimedia.org/r/927735 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [16:27:23] PROBLEM - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:47] (03PS1) 10Effie Mouzeli: admin: bump thumbor namespace's limits (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927739 [16:27:56] !log restart pybal on lvs2013 to remove bgp-med override [16:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:26] !log low-traffic/codfw: set routing-options static route 10.2.1.0/24 next-hop 10.192.32.14 [16:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:47] (03CR) 10Effie Mouzeli: [C: 03+2] admin: bump thumbor namespace's limits (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927739 (owner: 10Effie Mouzeli) [16:33:24] (03Merged) 10jenkins-bot: admin: bump thumbor namespace's limits (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927739 (owner: 10Effie Mouzeli) [16:34:51] (03PS1) 10Andrew Bogott: vendordata: remove malicious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/927740 (https://phabricator.wikimedia.org/T338195) [16:35:31] (03CR) 10Andrew Bogott: [C: 03+2] vendordata: remove malicious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/927740 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [16:36:33] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:36:59] (03PS1) 10ArielGlenn: for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) [16:36:59] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:37:27] (03CR) 10CI reject: [V: 04-1] for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:37:42] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:39:11] (03PS2) 10ArielGlenn: for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) [16:39:24] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:40:05] (03CR) 10jenkins-bot: for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:40:45] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:40:51] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:41:04] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:42:37] RECOVERY - Check systemd state on clouddb1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:39] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10mpopov) @KCVelaga_WMF: To test, SSH to any of the stat boxes and run the following command: ` $ sudo -u analytics-product kerberos-run-command ana... [16:48:22] (03CR) 10Btullis: Decommission analytics1058 from hadoop cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927667 (https://phabricator.wikimedia.org/T338227) (owner: 10Stevemunene) [16:51:48] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:52:18] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10KFrancis) I'm still waiting for the following info from the user: volunteer's full name, mailing address, and email to process the NDA. [16:57:01] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338236 (10wiki_willy) a:03Jclark-ctr [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1700) [17:01:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [17:04:15] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [17:04:49] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:05:30] (03PS3) 10ArielGlenn: for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) [17:05:37] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:05:49] (03PS1) 10Urbanecm: PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927694 (https://phabricator.wikimedia.org/T338078) [17:06:03] (03PS1) 10Urbanecm: PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927695 (https://phabricator.wikimedia.org/T338078) [17:06:25] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Galessandroni) Hi. In Vikidia (an European Wikipedia for kids) we have several problem to enroll it in Wikimedi... [17:13:02] (03CR) 10Hashar: [C: 03+1] "This can be merged for the sole reason that all target hosts had the remote url fixed manually and thus it is going to work." [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [17:15:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:08] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/927743 [17:15:51] (03PS3) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [17:18:12] (03CR) 10CI reject: [V: 04-1] wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [17:18:24] (03PS4) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [17:18:39] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:26] (03PS5) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [17:20:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41578/console" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [17:23:07] (03CR) 10Jbond: "-1: not sure why just yet but from pcc i can see that filter_params is not working" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [17:25:28] (03CR) 10Jbond: [V: 03+1] wmflib: update dump_params and add filter_params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [17:26:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] ntp: do not restart the ntp service on conf change [puppet] - 10https://gerrit.wikimedia.org/r/926598 (owner: 10Ssingh) [17:27:28] !log sudo cumin 'P:ntp' 'disable-puppet "testing CR 926598"' [17:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:59] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:20] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Andrew) I now have the proper version installing via cloud-init () but now when puppet is invoked it says: ` root@buildvm-c88281bc-7bb0-4a46-9f97-c9b59ba3b845:~#... [17:33:11] (03CR) 10JHathaway: wmflib: update dump_params and add filter_params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [17:33:23] (03PS1) 10Ssingh: P:ntp: add python3-pystemd package [puppet] - 10https://gerrit.wikimedia.org/r/927744 [17:33:27] (03PS1) 10Herron: aptrepo: add logrotate bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/927745 (https://phabricator.wikimedia.org/T338127) [17:33:29] (03PS1) 10Herron: mwlog: upgrade logrotate and use ignoreduplicates [puppet] - 10https://gerrit.wikimedia.org/r/927746 (https://phabricator.wikimedia.org/T338127) [17:34:00] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Andrew) > > I think this is a missing dependency in the package. Indeed, installing 'ruby-sorted-set' fixes things. So the puppet package should have a dependenc... [17:34:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41579/console" [puppet] - 10https://gerrit.wikimedia.org/r/927744 (owner: 10Ssingh) [17:36:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:ntp: add python3-pystemd package [puppet] - 10https://gerrit.wikimedia.org/r/927744 (owner: 10Ssingh) [17:36:14] (03CR) 10Herron: "this is meant to contain a bullseye backport of the logrotate version from bookworm for selective installation. Initially targeting the m" [puppet] - 10https://gerrit.wikimedia.org/r/927745 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [17:36:38] (03PS1) 10Andrew Bogott: cloud-vps vendordata: install ruby-sorted-set [puppet] - 10https://gerrit.wikimedia.org/r/927747 (https://phabricator.wikimedia.org/T338195) [17:37:07] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps vendordata: install ruby-sorted-set [puppet] - 10https://gerrit.wikimedia.org/r/927747 (https://phabricator.wikimedia.org/T338195) (owner: 10Andrew Bogott) [17:39:24] !log sudo cumin 'P:ntp' 'enable-puppet "testing CR 926598" && run-puppet-agent' [17:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:05] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:41:23] hmm [17:43:21] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:45:53] (03PS1) 10Ssingh: P:durum: s/Wikidough/Wikimedia DNS [puppet] - 10https://gerrit.wikimedia.org/r/927748 [17:46:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41580/console" [puppet] - 10https://gerrit.wikimedia.org/r/927748 (owner: 10Ssingh) [17:46:54] (03PS1) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [17:49:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [17:50:46] !log disable puppet on A:cp-text to roll out CR 926611 [17:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:35] (03PS2) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [17:52:01] (03CR) 10Ssingh: [C: 03+2] varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [17:52:50] (03PS6) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [17:53:57] (03PS6) 10Ottomata: eventlogging: remove CentralNoticeTiming [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [17:54:28] (03CR) 10Ottomata: "Ah" [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [17:54:32] (03CR) 10Ottomata: [C: 03+2] eventlogging: remove CentralNoticeTiming [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [17:54:41] !log enable puppet on cp4037 to test CR 926611 [17:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:45] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/927749/41582/" [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [17:55:03] (03PS1) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:55:26] (03CR) 10CI reject: [V: 04-1] git::clone: Ensure that the URL for origin is always up-to-date [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [17:55:41] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:56:14] (03PS2) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:56:34] (03CR) 10Dzahn: "Is the goal though to make puppet change a git remote without deleting existing files?" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [17:57:05] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:58:15] (03PS3) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [17:59:46] (03CR) 10BCornwall: [C: 04-1] "I do not agree with adding another class abstraction. On the surface de-duplicating code sounds reasonable but cookbooks are already over-" [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:00:05] jeena and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T1800). nyaa~ [18:00:41] (03CR) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:01:21] !log re-enable puppet on A:cp-text and force puppet run: T338064 [18:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:25] T338064: decom sitemaps.wikimedia.org - https://phabricator.wikimedia.org/T338064 [18:01:51] !log cumin 'A:cp-text' 'enable-puppet "CR 926611" && run-puppet-agent -q' [18:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:durum: s/Wikidough/Wikimedia DNS [puppet] - 10https://gerrit.wikimedia.org/r/927748 (owner: 10Ssingh) [18:02:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927751 (https://phabricator.wikimedia.org/T337526) [18:02:43] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927751 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:03:03] (03PS7) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:03:26] (03CR) 10BCornwall: sre.cdn: move common functions to base class (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:03:29] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927751 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:04:17] (03PS8) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:04:41] (03CR) 10BCornwall: sre.cdn: move common functions to base class (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:05:13] (03CR) 10BCornwall: [C: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [18:06:30] (03CR) 10Dzahn: "Thank you Sukhbir for deploying! I confirm URLs like https://id.wikipedia.org/sitemap are now 404s, as predicted by Brandon." [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:08:21] (03PS1) 10Slyngshede: C:idm switch to read/write user for LDAP access. [puppet] - 10https://gerrit.wikimedia.org/r/927752 [18:09:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41586/console" [puppet] - 10https://gerrit.wikimedia.org/r/927752 (owner: 10Slyngshede) [18:09:56] (03CR) 10Ssingh: [C: 03+1] trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:10:16] (03CR) 10Ahmon Dancy: "Some PCC results: https://puppet-compiler.wmflabs.org/output/927750/41585/" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:10:19] (03CR) 10Dzahn: [C: 03+2] trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:10:52] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.12 refs T337526 [18:10:53] !log disabling https://sitemaps.wikimedia.org - T338064 T332101 [18:10:55] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [18:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:58] T338064: decom sitemaps.wikimedia.org - https://phabricator.wikimedia.org/T338064 [18:10:59] T332101: determine whether https://sitemaps.wikimedia.org still serves a purpose - https://phabricator.wikimedia.org/T332101 [18:11:41] (03CR) 10JHathaway: [C: 03+2] lists: Use stock mailman3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/927684 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [18:11:45] (03CR) 10Ahmon Dancy: "For the record, dealing with changes to the branch will be handled separately." [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:12:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:12:24] (03CR) 10Dzahn: "won't that result in undefined state of the files on disk?" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:13:30] (Device rebooted) firing: Alert for device ps1-c3-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:18:30] (Device rebooted) resolved: Device ps1-c3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:19:30] (Device rebooted) firing: Alert for device ps1-c2-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:19:52] (03CR) 10Dzahn: [C: 03+2] "The index page is still cached for now but for example https://sitemaps.wikimedia.org/it.wikipedia.org/ is also a 404 now." [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:24:29] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: update location of nrpe::plugin for check [puppet] - 10https://gerrit.wikimedia.org/r/927711 (owner: 10Ssingh) [18:24:30] (Device rebooted) resolved: Device ps1-c2-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:29:40] PROBLEM - PHP opcache health on mw1461 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:29:44] (03CR) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:37:06] PROBLEM - puppet last run on planet2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604969 seconds, message: dz - dzahn, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:37:21] mutante: ^ [18:37:28] (03PS13) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [18:37:54] RhinosF1: oh, weird.. I completey forgot. thanks [18:38:42] fixed [18:39:03] Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d/10_bastion-ssh]/ensure: removed [18:39:06] really... [18:39:47] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [18:40:10] PROBLEM - PHP opcache health on mw1467 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:40:30] (03CR) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [18:42:36] RECOVERY - puppet last run on planet2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:43:30] (Device rebooted) firing: Alert for device ps1-d5-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:45:18] (03PS4) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [18:45:22] (03CR) 10Dzahn: "Generally makes sense to me to separate "first time install" from "upgrade"." [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [18:45:45] (03PS15) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [18:47:13] (03PS16) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [18:47:43] (03CR) 10Dzahn: "your other option is to keep one script for both actions but add a parameter, so it's like "install_vrts.sh" vs "install_vrts.sh --upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [18:48:02] (03CR) 10Krinkle: webperf: Fix /excimer/ POST restriction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [18:48:28] (03PS1) 10Daniel Kinzler: Enable cache warming jobs for parsoid per default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) [18:48:30] (Device rebooted) firing: (2) Alert for device ps1-b3-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:49:56] (03PS17) 10Krinkle: webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) [18:51:26] (03CR) 10Krinkle: [C: 03+1] "Ready to go. Verified on Beta." [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [18:53:30] (Device rebooted) firing: (2) Alert for device ps1-a8-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:58:30] (Device rebooted) resolved: (2) Device ps1-a8-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:00:57] (03PS1) 10Ladsgroup: poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) [19:02:52] (03PS5) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [19:03:28] (03PS4) 10Jforrester: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [19:03:30] (Device rebooted) firing: Alert for device ps1-a1-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:03:36] (03CR) 10AOkoth: vrts: separate install & ugprade vrts scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:03:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:03:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48906 and previous config saved to /var/cache/conftool/dbconfig/20230606-190402-ladsgroup.json [19:04:05] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:04:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:04:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:04:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:04:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P48907 and previous config saved to /var/cache/conftool/dbconfig/20230606-190420-ladsgroup.json [19:06:50] (03CR) 10CI reject: [V: 04-1] poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [19:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48908 and previous config saved to /var/cache/conftool/dbconfig/20230606-190802-ladsgroup.json [19:08:30] (Device rebooted) resolved: Device ps1-a1-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:12:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:14:25] jeena: I think we have a train blocker [19:15:26] (03PS2) 10Ladsgroup: poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) [19:15:36] hmm what have I missed? I didn't see any glaring errors [19:15:50] `Utils:94 PHP Notice: Undefined offset: 0` ? [19:16:31] oh, does that warrant rolling back or just blocking? I didn't think the error count was that high [19:16:37] T338264 [19:16:38] T338264: Caught exception of type Flow\Exception\DataModelException when trying to submit on MediaWiki.org - https://phabricator.wikimedia.org/T338264 [19:17:27] thanks reedy [19:18:09] I think the undefined offset error may be linked (similar histogram in logspam-watch). [19:18:20] (both .12) [19:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P48909 and previous config saved to /var/cache/conftool/dbconfig/20230606-192308-ladsgroup.json [19:24:48] (03CR) 10Dzahn: ""This all assumes that the old and new URLs refer to the same repo (i.e., they have shared ancestry), not a weird situation where the new " [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [19:26:27] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10MatthewVernon) >>! In T337121#8872212, @Nux wrote: > Sent an e-mail signed with my PGP, fingerprint: `86C84A9B865FDD51FCFB12D2EE3F8013A0DD3792`. and >>! In T337121#8906818, @KFrancis wrote: > I'm sti... [19:26:59] (03CR) 10Dzahn: [C: 03+2] remove sitemaps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/926613 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [19:27:05] (03PS2) 10Dzahn: remove sitemaps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/926613 (https://phabricator.wikimedia.org/T338064) [19:30:08] (03PS3) 10Dzahn: miscweb: remove sitemaps profile from role [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) [19:32:07] (03PS6) 10AOkoth: vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) [19:36:44] (03CR) 10Dzahn: [C: 03+1] "seems ok to me. nitpick: technically "sudo", "cp", "ln", "systemctl" etc all could also have full path to be consistent. But that doesn't " [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:36:54] (03CR) 10AOkoth: [C: 03+2] vrts: separate install & ugprade vrts scripts [puppet] - 10https://gerrit.wikimedia.org/r/927749 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P48910 and previous config saved to /var/cache/conftool/dbconfig/20230606-193814-ladsgroup.json [19:41:03] (03CR) 10Dzahn: [C: 03+2] "deleted from DNS" [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [19:53:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48911 and previous config saved to /var/cache/conftool/dbconfig/20230606-195320-ladsgroup.json [19:53:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:53:24] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:53:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:55:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:55:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:55:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:55:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T336886)', diff saved to https://phabricator.wikimedia.org/P48912 and previous config saved to /var/cache/conftool/dbconfig/20230606-195557-ladsgroup.json [19:59:30] (Device rebooted) firing: Alert for device ps1-c8-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:59:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T336886)', diff saved to https://phabricator.wikimedia.org/P48913 and previous config saved to /var/cache/conftool/dbconfig/20230606-195948-ladsgroup.json [19:59:52] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T2000). [20:00:05] Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:04:30] (Device rebooted) resolved: Device ps1-c8-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [20:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P48914 and previous config saved to /var/cache/conftool/dbconfig/20230606-200444-ladsgroup.json [20:08:06] Oh, it's UTC late window time already. Looks like I'm the only one in queue. I'll proceed. [20:08:19] (03CR) 10Urbanecm: [C: 03+2] PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927695 (https://phabricator.wikimedia.org/T338078) (owner: 10Urbanecm) [20:08:24] (03CR) 10Urbanecm: [C: 03+2] PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927694 (https://phabricator.wikimedia.org/T338078) (owner: 10Urbanecm) [20:08:40] (03PS1) 10Cwhite: opensearch: clean up hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/927769 (https://phabricator.wikimedia.org/T333732) [20:09:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1026.eqiad.wmnet with OS bullseye [20:09:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye [20:09:25] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [20:09:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [20:09:40] 10SRE, 10Traffic, 10Wikibase Product Platform: Beta wikidata rejects PATCH requests - https://phabricator.wikimedia.org/T336659 (10WMDE-leszek) 05Open→03Resolved a:03WMDE-leszek This seems to be resolved. [20:13:59] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/925120/41588/" [puppet] - 10https://gerrit.wikimedia.org/r/925120 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [20:14:07] (ProbeDown) firing: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P48915 and previous config saved to /var/cache/conftool/dbconfig/20230606-201454-ladsgroup.json [20:16:53] !log miscweb1003, miscweb2003 - rm -rf /srv/org/wikimedia/sitemaps after removing httpd virtual host T338064 [20:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:57] T338064: decom sitemaps.wikimedia.org - https://phabricator.wikimedia.org/T338064 [20:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48916 and previous config saved to /var/cache/conftool/dbconfig/20230606-201950-ladsgroup.json [20:28:17] (03CR) 10Hashar: [C: 03+1] "I think that one is good to go and it does address the issue I have described at https://gerrit.wikimedia.org/r/c/operations/puppet/+/9250" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [20:28:48] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10bd808) > 3. Upgrade puppetmasters to version 7, which should be backwards-compatible with existing clients. That's part of {T330490} [20:29:50] (03PS1) 10EoghanGaffney: gitlab: Change regex match for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/927764 (https://phabricator.wikimedia.org/T338240) [20:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P48917 and previous config saved to /var/cache/conftool/dbconfig/20230606-203000-ladsgroup.json [20:32:04] (03Merged) 10jenkins-bot: PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927695 (https://phabricator.wikimedia.org/T338078) (owner: 10Urbanecm) [20:33:14] (03Merged) 10jenkins-bot: PersonalizedPraiseLogger: Only include mentee_id if not null [extensions/GrowthExperiments] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927694 (https://phabricator.wikimedia.org/T338078) (owner: 10Urbanecm) [20:34:02] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:927695|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]], [[gerrit:927694|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]] [20:34:05] T338078: eventgate_validation_error - 'mentee_id' should be integer - https://phabricator.wikimedia.org/T338078 [20:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48919 and previous config saved to /var/cache/conftool/dbconfig/20230606-203456-ladsgroup.json [20:35:37] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:927695|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]], [[gerrit:927694|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:39:50] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/927769/41591/" [puppet] - 10https://gerrit.wikimedia.org/r/927769 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:41:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:927695|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]], [[gerrit:927694|PersonalizedPraiseLogger: Only include mentee_id if not null (T338078)]] (duration: 07m 23s) [20:41:30] T338078: eventgate_validation_error - 'mentee_id' should be integer - https://phabricator.wikimedia.org/T338078 [20:41:31] * urbanecm done [20:41:40] (03CR) 10Cwhite: [C: 03+1] "Let's try it!" [alerts] - 10https://gerrit.wikimedia.org/r/927626 (owner: 10Filippo Giunchedi) [20:45:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T336886)', diff saved to https://phabricator.wikimedia.org/P48920 and previous config saved to /var/cache/conftool/dbconfig/20230606-204506-ladsgroup.json [20:45:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [20:45:10] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:45:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Maintenance [20:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1183 (T336886)', diff saved to https://phabricator.wikimedia.org/P48921 and previous config saved to /var/cache/conftool/dbconfig/20230606-204527-ladsgroup.json [20:46:00] (03PS1) 10Cwhite: opensearch: disable security plugin on codfw [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) [20:46:02] (03PS1) 10Cwhite: opensearch: disable security plugin for both clusters [puppet] - 10https://gerrit.wikimedia.org/r/927772 (https://phabricator.wikimedia.org/T333732) [20:49:02] 10Puppet, 10Release-Engineering-Team: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277 (10hashar) [20:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P48922 and previous config saved to /var/cache/conftool/dbconfig/20230606-205002-ladsgroup.json [20:50:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [20:50:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [20:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T336886)', diff saved to https://phabricator.wikimedia.org/P48923 and previous config saved to /var/cache/conftool/dbconfig/20230606-205123-ladsgroup.json [20:51:27] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:51:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [20:52:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [20:52:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T336886)', diff saved to https://phabricator.wikimedia.org/P48924 and previous config saved to /var/cache/conftool/dbconfig/20230606-205206-ladsgroup.json [20:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T336886)', diff saved to https://phabricator.wikimedia.org/P48925 and previous config saved to /var/cache/conftool/dbconfig/20230606-205530-ladsgroup.json [20:57:01] (03PS1) 10JHathaway: DO NOT MERGE: Ensure profile::apt is applied first [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) [20:59:38] (03PS1) 10JHathaway: DO NOT MERGE: apply profile::apt in separate stage [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) [21:01:40] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: apply profile::apt in separate stage [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [21:02:27] (03PS1) 10Andrew Bogott: cloud-vps VMs: don't install py2 mwopenstackclients on py3 distros [puppet] - 10https://gerrit.wikimedia.org/r/927791 (https://phabricator.wikimedia.org/T338188) [21:03:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1026.eqiad.wmnet with OS bullseye [21:03:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye [21:06:24] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps VMs: don't install py2 mwopenstackclients on py3 distros [puppet] - 10https://gerrit.wikimedia.org/r/927791 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [21:06:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P48926 and previous config saved to /var/cache/conftool/dbconfig/20230606-210629-ladsgroup.json [21:07:08] (03PS1) 10Andrew Bogott: apt::repository: remove conflicting .list files from bookworm /etc/apt [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) [21:08:48] (03PS2) 10Andrew Bogott: apt::repository: remove conflicting .list files from bookworm /etc/apt [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) [21:09:15] (03PS1) 10JHathaway: puppet7: re-add mailalias core [puppet] - 10https://gerrit.wikimedia.org/r/927796 (https://phabricator.wikimedia.org/T330490) [21:10:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P48927 and previous config saved to /var/cache/conftool/dbconfig/20230606-211036-ladsgroup.json [21:13:11] (03CR) 10JHathaway: [C: 03+1] "The other option would be to purge unmanaged files in /etc/apt/sources.list.d, but if that is not an option for the cloud then this seems " [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [21:14:42] 10SRE, 10DNS: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 (10Ijon) [21:17:03] (03CR) 10Andrew Bogott: apt::repository: remove conflicting .list files from bookworm /etc/apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [21:20:48] (03PS1) 10Dzahn: add app.dev.learn.wiki pointing to AWS [dns] - 10https://gerrit.wikimedia.org/r/927798 (https://phabricator.wikimedia.org/T335080) [21:21:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P48928 and previous config saved to /var/cache/conftool/dbconfig/20230606-212135-ladsgroup.json [21:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P48929 and previous config saved to /var/cache/conftool/dbconfig/20230606-212542-ladsgroup.json [21:26:18] (03CR) 10Dzahn: [C: 03+1] gitlab: Change regex match for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/927764 (https://phabricator.wikimedia.org/T338240) (owner: 10EoghanGaffney) [21:31:22] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Change regex match for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/927764 (https://phabricator.wikimedia.org/T338240) (owner: 10EoghanGaffney) [21:33:57] (03PS2) 10Dzahn: add app.dev.learn.wiki pointing to AWS [dns] - 10https://gerrit.wikimedia.org/r/927798 (https://phabricator.wikimedia.org/T338280) [21:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T336886)', diff saved to https://phabricator.wikimedia.org/P48930 and previous config saved to /var/cache/conftool/dbconfig/20230606-213641-ladsgroup.json [21:36:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [21:36:45] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:36:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [21:37:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T336886)', diff saved to https://phabricator.wikimedia.org/P48931 and previous config saved to /var/cache/conftool/dbconfig/20230606-213702-ladsgroup.json [21:39:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T336886)', diff saved to https://phabricator.wikimedia.org/P48932 and previous config saved to /var/cache/conftool/dbconfig/20230606-213954-ladsgroup.json [21:40:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T336886)', diff saved to https://phabricator.wikimedia.org/P48933 and previous config saved to /var/cache/conftool/dbconfig/20230606-214048-ladsgroup.json [21:40:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [21:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [21:41:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48934 and previous config saved to /var/cache/conftool/dbconfig/20230606-214109-ladsgroup.json [21:44:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48935 and previous config saved to /var/cache/conftool/dbconfig/20230606-214432-ladsgroup.json [21:44:36] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:44:43] (03CR) 10Muehlenhoff: apt::repository: remove conflicting .list files from bookworm /etc/apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [21:48:07] (03PS1) 10David Martin: Add wikifunctions.ui to wgEventLoggingStreamNames in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927801 (https://phabricator.wikimedia.org/T336722) [21:49:07] (ProbeDown) resolved: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:50:24] (03CR) 10Volans: [C: 03+1] "No objections" [cookbooks] - 10https://gerrit.wikimedia.org/r/926493 (owner: 10Ayounsi) [21:51:48] (03CR) 10David Martin: "So far I didn't see anything from our new stream in https://stream-beta.wmflabs.org/v2/ui, and it seems very likely this declaration is ne" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927801 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [21:52:40] (03CR) 10Volans: "I'm not sure if this is needed when using the deploy python code cookbook, and if it is it should be done at the uwsgi module level instea" [puppet] - 10https://gerrit.wikimedia.org/r/926454 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [21:55:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P48936 and previous config saved to /var/cache/conftool/dbconfig/20230606-215501-ladsgroup.json [21:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P48937 and previous config saved to /var/cache/conftool/dbconfig/20230606-215938-ladsgroup.json [22:10:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P48938 and previous config saved to /var/cache/conftool/dbconfig/20230606-221007-ladsgroup.json [22:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P48939 and previous config saved to /var/cache/conftool/dbconfig/20230606-221444-ladsgroup.json [22:17:05] jouncebot: nowandnext [22:17:05] No deployments scheduled for the next 7 hour(s) and 42 minute(s) [22:17:05] In 7 hour(s) and 42 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T0600) [22:17:56] (03PS2) 10Zabe: Stop writing to revision_comment_temp everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927615 (https://phabricator.wikimedia.org/T299954) [22:18:11] (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927615 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:18:58] (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927615 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:19:41] !log zabe@deploy1002 Started scap: Backport for [[gerrit:927615|Stop writing to revision_comment_temp everywhere (T299954)]] [22:19:44] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:21:32] !log zabe@deploy1002 zabe: Backport for [[gerrit:927615|Stop writing to revision_comment_temp everywhere (T299954)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:23:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:24:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:25:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T336886)', diff saved to https://phabricator.wikimedia.org/P48940 and previous config saved to /var/cache/conftool/dbconfig/20230606-222513-ladsgroup.json [22:25:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [22:25:17] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:25:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [22:25:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T336886)', diff saved to https://phabricator.wikimedia.org/P48941 and previous config saved to /var/cache/conftool/dbconfig/20230606-222534-ladsgroup.json [22:27:14] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:927615|Stop writing to revision_comment_temp everywhere (T299954)]] (duration: 07m 33s) [22:27:17] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T336886)', diff saved to https://phabricator.wikimedia.org/P48942 and previous config saved to /var/cache/conftool/dbconfig/20230606-222828-ladsgroup.json [22:29:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [22:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T336886)', diff saved to https://phabricator.wikimedia.org/P48943 and previous config saved to /var/cache/conftool/dbconfig/20230606-222950-ladsgroup.json [22:29:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [22:29:53] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [22:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [22:30:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T336886)', diff saved to https://phabricator.wikimedia.org/P48944 and previous config saved to /var/cache/conftool/dbconfig/20230606-223011-ladsgroup.json [22:33:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T336886)', diff saved to https://phabricator.wikimedia.org/P48945 and previous config saved to /var/cache/conftool/dbconfig/20230606-223335-ladsgroup.json [22:33:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:35:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:36:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:41:39] (03PS1) 10Volans: cookbooks: improve test-cookbook binary [puppet] - 10https://gerrit.wikimedia.org/r/927803 [22:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P48946 and previous config saved to /var/cache/conftool/dbconfig/20230606-224334-ladsgroup.json [22:47:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:48:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:48:41] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [22:48:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P48947 and previous config saved to /var/cache/conftool/dbconfig/20230606-224841-ladsgroup.json [22:48:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:50:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [22:51:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:51:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [22:51:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:52:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P48948 and previous config saved to /var/cache/conftool/dbconfig/20230606-225841-ladsgroup.json [22:59:24] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/927771/41592/" [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [23:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P48949 and previous config saved to /var/cache/conftool/dbconfig/20230606-230347-ladsgroup.json [23:04:45] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10Nux) @KFrancis I sent you an email. Please let me know if it didn't arrive or something else would be needed. [23:07:52] (03PS14) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [23:10:01] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [23:13:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T336886)', diff saved to https://phabricator.wikimedia.org/P48950 and previous config saved to /var/cache/conftool/dbconfig/20230606-231347-ladsgroup.json [23:13:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [23:13:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:14:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [23:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T336886)', diff saved to https://phabricator.wikimedia.org/P48951 and previous config saved to /var/cache/conftool/dbconfig/20230606-231408-ladsgroup.json [23:15:23] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - pt1979@cumin2002" [23:16:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - pt1979@cumin2002" [23:16:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:16:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [23:16:56] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [23:16:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:17:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T336886)', diff saved to https://phabricator.wikimedia.org/P48952 and previous config saved to /var/cache/conftool/dbconfig/20230606-231758-ladsgroup.json [23:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T336886)', diff saved to https://phabricator.wikimedia.org/P48953 and previous config saved to /var/cache/conftool/dbconfig/20230606-231853-ladsgroup.json [23:18:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [23:18:56] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:19:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [23:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T336886)', diff saved to https://phabricator.wikimedia.org/P48954 and previous config saved to /var/cache/conftool/dbconfig/20230606-231913-ladsgroup.json [23:19:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - pt1979@cumin2002" [23:19:48] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [23:20:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - pt1979@cumin2002" [23:20:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:22:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T336886)', diff saved to https://phabricator.wikimedia.org/P48955 and previous config saved to /var/cache/conftool/dbconfig/20230606-232235-ladsgroup.json [23:25:38] (03PS15) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [23:26:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [23:26:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [23:27:50] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [23:29:59] (03PS16) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [23:33:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P48958 and previous config saved to /var/cache/conftool/dbconfig/20230606-233304-ladsgroup.json [23:34:53] (03PS3) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [23:35:50] (03CR) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [23:36:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) @Papaul having issues imaging servers dbproxy1022,dbproxy1023,dbproxy1026,dbproxy1027 The above exception was the direct cause of the following e... [23:37:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P48959 and previous config saved to /var/cache/conftool/dbconfig/20230606-233742-ladsgroup.json [23:40:06] (03CR) 10Tim Starling: [C: 03+1] webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:42:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet [23:45:37] (03CR) 10Tim Starling: [C: 03+1] Profiler: Replace copy of ExcimerClient.php with git submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926574 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [23:48:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P48960 and previous config saved to /var/cache/conftool/dbconfig/20230606-234810-ladsgroup.json [23:51:21] (03PS4) 10Catrope: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) [23:52:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P48961 and previous config saved to /var/cache/conftool/dbconfig/20230606-235248-ladsgroup.json