[00:04:29] (03PS19) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [00:05:40] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [00:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/998280 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/998280 (owner: 10TrainBranchBot) [00:45:02] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/998280 (owner: 10TrainBranchBot) [01:14:30] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:26:04] "Error: 502, Broken pipe at 2024-02-08 01:25:46 GMT" [01:26:17] "Our servers are currently under maintenance or experiencing a technical problem" [01:26:20] :O [01:26:37] "Request from [IP] via cp1108.eqiad.wmnet, ATS/9.1.4" [01:27:15] 10SRE, 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase20[13-20] - https://phabricator.wikimedia.org/T356695 (10Jhancock.wm) 05Open→03Resolved [01:27:57] (ProbeDown) firing: (26) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:05] ^^^ [01:28:22] nvm [01:28:32] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2316.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2392.codfw.wmnet, mw2371.codfw.wmnet, mw2274.codfw.wmnet, mw2415.codfw.wmnet, mw2333.codfw.wmnet, mw2393.codfw.wmnet, mw2311.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2335.codfw.wmnet, mw2413.codfw.wmnet, mw [01:28:32] fw.wmnet, mw2329.codfw.wmnet, mw2325.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw.wmnet, mw2387.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2361.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2272.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw [01:28:32] mw2337.codfw.wmnet, mw2307.codfw.wmnet, mw2379.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2432.codfw.wmnet, mw230 https://wikitech.wikimedia.org/wiki/PyBal [01:29:04] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2325.codfw.wmnet, mw2393.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2408.codfw.wmnet, mw2387.codfw.wmnet, mw2269.codfw.wmnet, mw [01:29:04] fw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2385.codfw.wmnet, mw2274.codfw.wmnet, mw2441.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2307.codfw.wmnet, mw2379.codfw.wmnet, mw2407.codfw.wmnet, mw2336.codfw.wmnet, mw2303.codfw.wmnet, mw2391.codfw.wmnet, mw2309.codfw.wmnet, mw2380.codfw.wmnet, mw2311.codfw [01:29:04] mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2371.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2350.codfw.wmnet, k https://wikitech.wikimedia.org/wiki/PyBal [01:29:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1009.eqiad.wmnet, mw1380.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1473.eqiad.wmnet, mw1475.eqiad.wmnet, kubernetes1062.eqiad.wmnet, kubernetes1043.eqiad.wmnet, kubernetes1061.eqiad.wmnet, mw1439.eqiad.wmnet, mw1464.eqiad.wmnet, mw1459.eqiad.wmnet, kubernetes1042.eqiad.wmnet, kubernetes1047.eqiad.wmnet, kubernetes1030. [01:29:10] net, mw1382.eqiad.wmnet, mw1461.eqiad.wmnet, mw1472.eqiad.wmnet, kubernetes1052.eqiad.wmnet, mw1408.eqiad.wmnet, mw1495.eqiad.wmnet, mw1363.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1050.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1379.eqiad.wmnet, kubernetes1026.eqiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1383.eqiad.wmnet, mw1460.eqiad.wmnet, kubernetes1057.eqiad.wmnet, kubernetes1053.eqiad.wmnet, kuber [01:29:10] 5.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1376.eqiad.wmnet, kubernetes1054.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:29:28] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers mw1469.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1039.eqiad.wmnet, kubernetes1041.eqiad.wmnet, mw1471.eqiad.wmnet, mw1459.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1043.eqiad.wmnet, kubernetes1031.eqiad.wmnet, kubernetes1007.eqiad.wmnet, mw1470.eqiad.wmnet, mw1396.eqiad.wmnet, kuberne [01:29:28] eqiad.wmnet, mw1381.eqiad.wmnet, kubernetes1047.eqiad.wmnet, mw1377.eqiad.wmnet, kubernetes1038.eqiad.wmnet, kubernetes1029.eqiad.wmnet, mw1488.eqiad.wmnet, kubernetes1060.eqiad.wmnet, kubernetes1052.eqiad.wmnet, mw1408.eqiad.wmnet, mw1440.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1035.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1036.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1375.eqiad.wmnet, k [01:29:28] s1057.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1045.eqiad.wmnet, kubernetes1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:29:29] (PHPFPMTooBusy) firing: (2) Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:29:32] (ProbeDown) firing: (15) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:29:44] (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:30:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:30:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:30:46] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:30:50] (ProbeDown) firing: (19) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:31:15] (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw appserver GET/200: 31.89470647072931s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:31:15] (MediaWikiLatencyExceeded) firing: Average latency high: codfw api_appserver GET/200: 0.4803601000494778s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [01:31:21] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-web (k8s) 57m 15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:32:57] (ProbeDown) firing: (26) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:00] 10SRE: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951 (10lmata) [01:34:27] (PHPFPMTooBusy) resolved: (2) Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:34:31] (ProbeDown) firing: (29) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:44] (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:35:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:35:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:35:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 32.77% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:35:32] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:36:15] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw api_appserver GET/200: ... [01:36:15] 0.21419962203357562s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:36:15] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: codfw appserver GET/200: 1.8584347800480276s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:36:20] (MediaWikiLatencyExceeded) resolved: (2) p75 latency high: codfw mw-web (k8s) 1h 5m 19s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:37:27] 10SRE: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951 (10lmata) p:05Triage→03High [01:37:58] (ProbeDown) resolved: (24) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:19] what is/was that? [01:40:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 35.67% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:42:12] 10SRE: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951 (10lmata) [01:54:23] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:54:28] 10SRE: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951 (10Scott_French) Took a quick look at this - seems like a large latency excursion on the appserver side (https://grafana.wikimedia.org/goto/_0N8q7hIk?orgId=1), which correlates with a large spike... [01:57:03] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2196 to codfw - jhancock@cumin2002" [01:57:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2196 to codfw - jhancock@cumin2002" [01:57:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:58:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2196.mgmt.codfw.wmnet with reboot policy FORCED [02:00:06] RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:21] (03PS20) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [02:01:34] (03CR) 10CI reject: [V: 04-1] Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [02:04:02] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:39] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:12:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2197 to codfw - jhancock@cumin2002" [02:13:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2197 to codfw - jhancock@cumin2002" [02:13:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:13:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2197.mgmt.codfw.wmnet with reboot policy FORCED [02:15:58] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:17:43] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2198 to codfw - jhancock@cumin2002" [02:18:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2198 to codfw - jhancock@cumin2002" [02:18:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:20:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2198.mgmt.codfw.wmnet with reboot policy FORCED [02:30:06] RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:22] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:00] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2197.mgmt.codfw.wmnet with reboot policy FORCED [02:37:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2196.mgmt.codfw.wmnet with reboot policy FORCED [02:39:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2198.mgmt.codfw.wmnet with reboot policy FORCED [03:00:04] RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:25] (03PS1) 10Tim Starling: beta: Switch block schema to read-old/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998624 (https://phabricator.wikimedia.org/T355034) [03:00:27] (03PS1) 10Tim Starling: beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) [03:00:30] (03PS1) 10Tim Starling: beta: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998626 (https://phabricator.wikimedia.org/T355034) [03:03:56] PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:13:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:29] (03PS21) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [04:12:44] (03CR) 10BCornwall: Add module for ncmonitor (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [05:09:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 17 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:09:38] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:11:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:12:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:37:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:38:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:49:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:49:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:53:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2103 es2020 T355862', diff saved to https://phabricator.wikimedia.org/P56487 and previous config saved to /var/cache/conftool/dbconfig/20240208-055316-root.json [05:53:21] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [05:53:35] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) The databases are ready to be moved any time. [05:54:49] (03PS1) 10Marostegui: db2114: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/998656 [05:58:03] (03CR) 10Marostegui: [C: 03+2] db2114: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/998656 (owner: 10Marostegui) [05:59:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [05:59:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [05:59:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2108 (T355609)', diff saved to https://phabricator.wikimedia.org/P56488 and previous config saved to /var/cache/conftool/dbconfig/20240208-055944-marostegui.json [05:59:59] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:02:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2020 to es1 primary T351916', diff saved to https://phabricator.wikimedia.org/P56489 and previous config saved to /var/cache/conftool/dbconfig/20240208-060204-root.json [06:02:21] T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916 [06:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2032 T351916', diff saved to https://phabricator.wikimedia.org/P56490 and previous config saved to /var/cache/conftool/dbconfig/20240208-060226-root.json [06:03:12] (03PS1) 10Marostegui: es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998658 (https://phabricator.wikimedia.org/T351916) [06:03:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2032.codfw.wmnet with OS bookworm [06:04:37] (03CR) 10Marostegui: [C: 03+2] es2032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998658 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [06:12:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T355609)', diff saved to https://phabricator.wikimedia.org/P56491 and previous config saved to /var/cache/conftool/dbconfig/20240208-061200-marostegui.json [06:12:05] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:17:43] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2032.codfw.wmnet with reason: host reimage [06:22:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10MMiller_WMF) I am @JTannerWMF's manager and I approve. [06:23:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2032.codfw.wmnet with reason: host reimage [06:27:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P56492 and previous config saved to /var/cache/conftool/dbconfig/20240208-062706-marostegui.json [06:36:21] (03PS1) 10Marostegui: Revert "es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998675 [06:41:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2032.codfw.wmnet with OS bookworm [06:41:49] (03CR) 10Marostegui: [C: 03+2] Revert "es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998675 (owner: 10Marostegui) [06:42:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P56493 and previous config saved to /var/cache/conftool/dbconfig/20240208-064213-marostegui.json [06:45:05] RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s4 T355658 [06:47:01] T355658: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T355658 [06:47:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s4 T355658 [06:48:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T355658', diff saved to https://phabricator.wikimedia.org/P56494 and previous config saved to /var/cache/conftool/dbconfig/20240208-064802-arnaudb.json [06:48:13] (03CR) 10Vgutierrez: Add module for ncmonitor (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [06:48:25] PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:25] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:33] (03PS1) 10Kosta Harlan: Use real anonymous user in ComputedUserImpactLookup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998676 (https://phabricator.wikimedia.org/T356895) [06:55:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56495 and previous config saved to /var/cache/conftool/dbconfig/20240208-065513-root.json [06:56:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2032 back to es1 primary T351916', diff saved to https://phabricator.wikimedia.org/P56496 and previous config saved to /var/cache/conftool/dbconfig/20240208-065607-root.json [06:56:12] T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916 [06:57:11] (03PS1) 10Raymond Ndibe: [domainproxy]: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) [06:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T355609)', diff saved to https://phabricator.wikimedia.org/P56497 and previous config saved to /var/cache/conftool/dbconfig/20240208-065720-marostegui.json [06:57:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [06:57:24] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:57:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [06:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2120 (T355609)', diff saved to https://phabricator.wikimedia.org/P56498 and previous config saved to /var/cache/conftool/dbconfig/20240208-065742-marostegui.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T0700) [07:00:04] kormat, marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T0700) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T355609)', diff saved to https://phabricator.wikimedia.org/P56499 and previous config saved to /var/cache/conftool/dbconfig/20240208-071006-marostegui.json [07:10:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56500 and previous config saved to /var/cache/conftool/dbconfig/20240208-071018-root.json [07:10:28] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:12:34] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/992181 (https://phabricator.wikimedia.org/T355658) (owner: 10Gerrit maintenance bot) [07:12:53] !log Starting s4 codfw failover from db2140 to db2179 - T355658 [07:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:07] T355658: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T355658 [07:14:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T355658', diff saved to https://phabricator.wikimedia.org/P56501 and previous config saved to /var/cache/conftool/dbconfig/20240208-071414-arnaudb.json [07:16:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary and set section read-write T355658', diff saved to https://phabricator.wikimedia.org/P56502 and previous config saved to /var/cache/conftool/dbconfig/20240208-071559-arnaudb.json [07:16:15] (03CR) 10Arnaudb: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/992182 (https://phabricator.wikimedia.org/T355658) (owner: 10Gerrit maintenance bot) [07:16:37] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:18:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:19:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2140 T355658', diff saved to https://phabricator.wikimedia.org/P56503 and previous config saved to /var/cache/conftool/dbconfig/20240208-071916-arnaudb.json [07:19:20] T355658: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T355658 [07:24:34] (03CR) 10Dom Walden: [C: 03+1] beta: Switch block schema to read-old/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998624 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [07:25:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P56504 and previous config saved to /var/cache/conftool/dbconfig/20240208-072512-marostegui.json [07:25:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P56505 and previous config saved to /var/cache/conftool/dbconfig/20240208-072523-root.json [07:28:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2140 as able to serve API', diff saved to https://phabricator.wikimedia.org/P56506 and previous config saved to /var/cache/conftool/dbconfig/20240208-072808-arnaudb.json [07:34:18] (03PS1) 10Arnaudb: dns: update s4 master [dns] - 10https://gerrit.wikimedia.org/r/998282 (https://phabricator.wikimedia.org/T355658) [07:35:01] (03CR) 10Marostegui: [C: 03+1] dns: update s4 master [dns] - 10https://gerrit.wikimedia.org/r/998282 (https://phabricator.wikimedia.org/T355658) (owner: 10Arnaudb) [07:35:29] (03CR) 10Arnaudb: [C: 03+2] dns: update s4 master [dns] - 10https://gerrit.wikimedia.org/r/998282 (https://phabricator.wikimedia.org/T355658) (owner: 10Arnaudb) [07:39:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:39:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:40:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P56507 and previous config saved to /var/cache/conftool/dbconfig/20240208-074019-marostegui.json [07:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56508 and previous config saved to /var/cache/conftool/dbconfig/20240208-074029-root.json [07:41:37] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface::ipip: Create ipip[6]0 devices as auto [puppet] - 10https://gerrit.wikimedia.org/r/998438 (owner: 10Vgutierrez) [07:44:59] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:45:30] !log repool ncredir2001 [07:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:09] !log reboot ncredir2002 to validate https://gerrit.wikimedia.org/r/c/operations/puppet/+/998438 [07:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:55:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T355609)', diff saved to https://phabricator.wikimedia.org/P56509 and previous config saved to /var/cache/conftool/dbconfig/20240208-075526-marostegui.json [07:55:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:55:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56510 and previous config saved to /var/cache/conftool/dbconfig/20240208-075534-root.json [07:55:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:55:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T355609)', diff saved to https://phabricator.wikimedia.org/P56511 and previous config saved to /var/cache/conftool/dbconfig/20240208-075549-marostegui.json [07:58:39] jouncebot: nowandnext [07:58:39] For the next 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T0700) [07:58:39] In 0 hour(s) and 1 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T0800) [07:59:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 3.555 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:03:55] (03CR) 10Urbanecm: [C: 03+2] Use real anonymous user in ComputedUserImpactLookup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998676 (https://phabricator.wikimedia.org/T356895) (owner: 10Kosta Harlan) [08:04:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998676 (https://phabricator.wikimedia.org/T356895) (owner: 10Kosta Harlan) [08:04:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10MoritzMuehlenhoff) I've kicked off a rebalance of ganeti/A now that the maintenance is over. [08:08:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T355609)', diff saved to https://phabricator.wikimedia.org/P56512 and previous config saved to /var/cache/conftool/dbconfig/20240208-080814-marostegui.json [08:08:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56513 and previous config saved to /var/cache/conftool/dbconfig/20240208-081039-root.json [08:13:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:15:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:16:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:16:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:33] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:19:15] !log vgutierrez@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir and not P{ncredir2.*} and A:ncredir [08:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P56514 and previous config saved to /var/cache/conftool/dbconfig/20240208-082320-marostegui.json [08:25:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56515 and previous config saved to /var/cache/conftool/dbconfig/20240208-082544-root.json [08:26:38] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) > To make sure sure the apt package requires the correct nodejs version I think we have to bump this to nodejs (>= 16) and also a compatible npm version. I... [08:27:27] (03Merged) 10jenkins-bot: Use real anonymous user in ComputedUserImpactLookup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998676 (https://phabricator.wikimedia.org/T356895) (owner: 10Kosta Harlan) [08:29:18] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:998676|Use real anonymous user in ComputedUserImpactLookup (T356895)]] [08:29:22] T356895: [wmf.17 - testwiki] Impact module: Temporary delay in getting your information - https://phabricator.wikimedia.org/T356895 [08:30:06] RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:25] (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:58] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:08] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:998676|Use real anonymous user in ComputedUserImpactLookup (T356895)]] (duration: 07m 49s) [08:37:12] T356895: [wmf.17 - testwiki] Impact module: Temporary delay in getting your information - https://phabricator.wikimedia.org/T356895 [08:38:25] (SystemdUnitFailed) firing: (34) prometheus-phpfpm-statustext-textfile.service Failed on mw1357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P56516 and previous config saved to /var/cache/conftool/dbconfig/20240208-083827-marostegui.json [08:42:20] (03CR) 10Muehlenhoff: [C: 03+1] "Not sure what the CI failure is about, but the change LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [08:43:25] (SystemdUnitFailed) resolved: (34) prometheus-phpfpm-statustext-textfile.service Failed on mw1357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:10] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [08:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T355609)', diff saved to https://phabricator.wikimedia.org/P56517 and previous config saved to /var/cache/conftool/dbconfig/20240208-085334-marostegui.json [08:53:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:53:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T355609)', diff saved to https://phabricator.wikimedia.org/P56518 and previous config saved to /var/cache/conftool/dbconfig/20240208-085357-marostegui.json [08:53:59] (03CR) 10Muehlenhoff: [C: 03+2] ferm::filter_log: Make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/995211 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [08:55:51] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:56:18] (03PS1) 10Slyngshede: Fix testcase to allow CI to run. [software/bitu] - 10https://gerrit.wikimedia.org/r/998765 [08:56:35] (03Abandoned) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [08:58:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945776 (owner: 10Muehlenhoff) [09:01:22] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org [09:01:35] (03CR) 10Muehlenhoff: Add informative titles to all pages. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) (owner: 10Slyngshede) [09:06:24] (03PS1) 10MVernon: swift: remove ms-be10[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/998770 (https://phabricator.wikimedia.org/T353149) [09:06:26] (03PS8) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [09:06:42] (03PS9) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [09:07:11] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [09:08:13] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org [09:10:57] (03PS1) 10Filippo Giunchedi: thanos: tighten memory limits for query/query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/998773 (https://phabricator.wikimedia.org/T356788) [09:11:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/998765 (owner: 10Slyngshede) [09:11:34] (03CR) 10Slyngshede: [C: 03+2] Fix testcase to allow CI to run. [software/bitu] - 10https://gerrit.wikimedia.org/r/998765 (owner: 10Slyngshede) [09:12:33] (03PS3) 10Slyngshede: Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) [09:13:14] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks for adding me." [puppet] - 10https://gerrit.wikimedia.org/r/998483 (owner: 10Andrew Bogott) [09:14:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:18:21] (03CR) 10Slyngshede: [C: 03+2] Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [09:18:35] (03PS10) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [09:18:39] 10SRE, 10observability, 10Patch-For-Review, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Current avenues I'm exploring: * Tighten the memory limits, `thanos-query` memory utilization jumps up... [09:21:03] !log vgutierrez@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir and not P{ncredir2.*} and A:ncredir [09:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:25:02] (03CR) 10Marostegui: [C: 03+1] swift: remove ms-be10[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/998770 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [09:25:48] (03PS2) 10Slyngshede: Add informative titles to all pages. [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) [09:26:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloud_private_subnet: remove old comment [puppet] - 10https://gerrit.wikimedia.org/r/998469 (owner: 10Arturo Borrero Gonzalez) [09:27:19] (03PS3) 10Slyngshede: Add informative titles to all pages. [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) [09:29:15] (03CR) 10Slyngshede: Add informative titles to all pages. (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) (owner: 10Slyngshede) [09:29:58] (03PS2) 10Slyngshede: Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 [09:31:24] (03Abandoned) 10Slyngshede: Code cleanup before enabling CI pipeline. [software/bitu] - 10https://gerrit.wikimedia.org/r/992074 (owner: 10Slyngshede) [09:31:50] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Jelto) [09:34:38] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Jelto) [09:34:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [09:34:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [09:35:39] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be10[44-50] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/998770 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [09:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:41] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Jelto) @cchen I unblocked your wikitech account. I checked all services above which should work. Can you try again accessing superset? (or resetting... [09:36:43] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [09:36:54] (03CR) 10Volans: [C: 04-1] "I did a quick pass, looks sane in general, apart a small typo, see inline. I've also left a couple of optional comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [09:37:38] (03CR) 10Majavah: [C: 04-1] "I do not think this will work as expected. Nginx buffers these in a tmpfs folder, which is only 1G:" [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [09:39:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Jelto) p:05Triage→03Medium [09:43:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [09:44:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Jelto) [09:44:09] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T355861#9523826, @MoritzMuehlenhoff wrote: > I've kicked... [09:44:15] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Jelto) @JTannerWMF you have to sign the L3 Wikimedia Server Access Responsibilities Document. Also we need a SSH public key (must be a separate key from Wikimedia cloud SSH... [09:44:19] (03CR) 10Volans: "post-merge comment" [software/bitu] - 10https://gerrit.wikimedia.org/r/998765 (owner: 10Slyngshede) [09:44:43] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: drop temporal NAT for legacy DNS resolvers [puppet] - 10https://gerrit.wikimedia.org/r/998780 (https://phabricator.wikimedia.org/T346426) [09:45:02] RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:56] (03CR) 10Majavah: cloudgw: drop temporal NAT for legacy DNS resolvers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998780 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [09:48:36] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah) [09:48:56] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:57] (03CR) 10Majavah: [C: 03+2] Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah) [09:49:08] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "don't merge this just yet." [puppet] - 10https://gerrit.wikimedia.org/r/998780 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [09:49:10] (03PS10) 10Majavah: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [09:50:23] (03Merged) 10jenkins-bot: Add py.typed marker [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991044 (owner: 10Majavah) [09:51:27] (03CR) 10Arturo Borrero Gonzalez: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:51:44] 989 [09:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:51:53] gonna play that at the lotto [09:52:48] (03CR) 10Majavah: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:53:32] (03PS1) 10Slyngshede: Bump Python version in CI from 3.7 to 3.11 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 [09:53:49] (03CR) 10CI reject: [V: 04-1] Bump Python version in CI from 3.7 to 3.11 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 (owner: 10Slyngshede) [09:54:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T355609)', diff saved to https://phabricator.wikimedia.org/P56520 and previous config saved to /var/cache/conftool/dbconfig/20240208-095429-marostegui.json [09:54:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:55:06] (03PS2) 10Slyngshede: Bump Python version in CI from 3.7 to 3.9 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 [09:58:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [10:00:28] !log jiji@cumin1002 conftool action : set/pooled=no; selector: service=kubesvc,name=mw2282.codfw.wmnet [10:01:05] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:01:44] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: service=kubesvc,name=mw2282.codfw.wmnet [10:02:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:03:03] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:03:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:04:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:04:33] (03PS1) 10Filippo Giunchedi: thanos: enable request debug for query / query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/998786 (https://phabricator.wikimedia.org/T356788) [10:04:56] (ProbeDown) firing: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:02] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:05:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:05:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:06:21] (03PS3) 10Slyngshede: Bump Python version in CI from 3.7 to 3.9 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 [10:06:46] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:07:09] 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10jijiki) [10:07:16] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:07:33] (03CR) 10Slyngshede: [C: 03+2] Fix testcase to allow CI to run. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/998765 (owner: 10Slyngshede) [10:08:45] (03PS4) 10Slyngshede: Bump Python version in CI from 3.7 to 3.9 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 [10:09:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:09:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P56521 and previous config saved to /var/cache/conftool/dbconfig/20240208-100936-marostegui.json [10:09:56] (ProbeDown) resolved: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:43] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P56522 and previous config saved to /var/cache/conftool/dbconfig/20240208-102442-marostegui.json [10:25:35] (03CR) 10Clément Goubert: [V: 03+1] "I would feel more comfortable if y'all did it, that way you can control the timing of having 3 DC's worth of lvs on the same conf node. Th" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [10:37:36] (03PS1) 10Jelto: sre.gitlab.upgrade: increase downtime to 120 minutes [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) [10:38:26] !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488) [10:38:32] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [10:38:33] T334488: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 [10:39:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488) [10:39:35] (03PS2) 10Jelto: sre.gitlab.upgrade: increase downtime to 180 minutes [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) [10:39:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T355609)', diff saved to https://phabricator.wikimedia.org/P56523 and previous config saved to /var/cache/conftool/dbconfig/20240208-103949-marostegui.json [10:39:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:39:54] (03CR) 10Clément Goubert: [C: 03+1] icinga: use systemd::timer::job for 'update-etcd-mw-config-lastindex' [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:39:54] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:40:03] !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488) [10:40:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:40:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T355609)', diff saved to https://phabricator.wikimedia.org/P56524 and previous config saved to /var/cache/conftool/dbconfig/20240208-104011-marostegui.json [10:40:17] (03CR) 10Jelto: sre.gitlab.upgrade: increase downtime to 180 minutes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [10:41:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [10:41:11] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488) [10:43:00] (03PS1) 10Slyngshede: Inform that cookies are required. [software/bitu] - 10https://gerrit.wikimedia.org/r/998795 (https://phabricator.wikimedia.org/T348435) [10:43:21] (03PS1) 10Btullis: Add a new Icinga contact group for team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) [10:43:27] (03PS3) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module ("copy" change) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 [10:43:29] (03PS5) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [10:43:31] (03PS1) 10Alexandros Kosiaris: modules.mesh: Sort based on name+version [deployment-charts] - 10https://gerrit.wikimedia.org/r/998797 [10:43:33] (03PS1) 10Alexandros Kosiaris: modules: Bump users of base.meta to 2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 [10:44:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) (owner: 10Slyngshede) [10:46:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1335/console" [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [10:47:01] (03CR) 10Alexandros Kosiaris: "> Use the new structure, including the change in annotations, in the scaffolding templates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [10:47:33] (03CR) 10Slyngshede: [C: 03+2] Add informative titles to all pages. [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) (owner: 10Slyngshede) [10:49:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/998773 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [10:49:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:50:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T355609)', diff saved to https://phabricator.wikimedia.org/P56525 and previous config saved to /var/cache/conftool/dbconfig/20240208-105110-marostegui.json [10:51:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:57:50] (03CR) 10EoghanGaffney: [C: 03+1] sre.gitlab.upgrade: increase downtime to 180 minutes [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [10:59:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ABran-WMF) [10:59:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ABran-WMF) [10:59:41] (03PS1) 10Hnowlan: thumbor: scale down a little [deployment-charts] - 10https://gerrit.wikimedia.org/r/998805 [11:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1100). [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1100) [11:00:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/998795 (https://phabricator.wikimedia.org/T348435) (owner: 10Slyngshede) [11:05:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, out of curiosity which icinga alerts will you be targeting?" [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [11:05:18] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: tighten memory limits for query/query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/998773 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:06:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P56526 and previous config saved to /var/cache/conftool/dbconfig/20240208-110616-marostegui.json [11:08:36] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10AndrewTavis_WMDE) @Manuel and I would suggest that this task remain open. Decisions on the data processes that require this account's private da... [11:08:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) I think there may be an issue here with the cable (usually the NIC firmware issue hits us when the debian-installer does it's DHCP request, rather than at the initia... [11:09:31] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] icinga: use systemd::timer::job for 'update-etcd-mw-config-lastindex' [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:11:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/998786 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:13:54] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: enable request debug for query / query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/998786 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:14:59] (03CR) 10Btullis: [V: 03+1] "Thanks Fillippo. Legacy ones :-)" [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [11:16:10] (03CR) 10Filippo Giunchedi: "As of today:" [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:16:23] (03CR) 10Muehlenhoff: "It's still used in Puppet in various places, though? E.g. confd::instance or Chartmuseum?" [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:17:34] (03CR) 10Muehlenhoff: [C: 03+1] "Ignore my previous comment, I had a old WIP patch branch open." [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:18:36] (03CR) 10Kamila Součková: [C: 03+1] thumbor: scale down a little [deployment-charts] - 10https://gerrit.wikimedia.org/r/998805 (owner: 10Hnowlan) [11:21:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P56527 and previous config saved to /var/cache/conftool/dbconfig/20240208-112123-marostegui.json [11:22:33] (03CR) 10Filippo Giunchedi: [C: 03+2] nrpe: remove monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:29:43] (03PS1) 10Filippo Giunchedi: nrpe: absent systemd_scripts [puppet] - 10https://gerrit.wikimedia.org/r/998822 (https://phabricator.wikimedia.org/T337831) [11:29:45] (03PS1) 10Filippo Giunchedi: nrpe: cleanup check_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998823 (https://phabricator.wikimedia.org/T337831) [11:31:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "Makes sense to me -- thank you for the context!" [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [11:31:32] (03CR) 10Slyngshede: [C: 03+2] Inform that cookies are required. [software/bitu] - 10https://gerrit.wikimedia.org/r/998795 (https://phabricator.wikimedia.org/T348435) (owner: 10Slyngshede) [11:36:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T355609)', diff saved to https://phabricator.wikimedia.org/P56528 and previous config saved to /var/cache/conftool/dbconfig/20240208-113630-marostegui.json [11:36:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:36:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:36:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:36:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:37:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:37:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T355609)', diff saved to https://phabricator.wikimedia.org/P56529 and previous config saved to /var/cache/conftool/dbconfig/20240208-113707-marostegui.json [11:37:50] (03CR) 10Cathal Mooney: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [11:39:30] (03Abandoned) 10Cathal Mooney: Do not NAT traffic from cloud VPS to cloud-private, and filter ports [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132) (owner: 10Cathal Mooney) [11:41:19] (03CR) 10Hnowlan: [C: 03+2] thumbor: scale down a little [deployment-charts] - 10https://gerrit.wikimedia.org/r/998805 (owner: 10Hnowlan) [11:41:41] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:41:44] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:42:12] (03Merged) 10jenkins-bot: thumbor: scale down a little [deployment-charts] - 10https://gerrit.wikimedia.org/r/998805 (owner: 10Hnowlan) [11:46:51] (03PS1) 10Btullis: Update default role contactgroups from analytics to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998826 (https://phabricator.wikimedia.org/T342578) [11:48:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T355609)', diff saved to https://phabricator.wikimedia.org/P56530 and previous config saved to /var/cache/conftool/dbconfig/20240208-114759-marostegui.json [11:48:04] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:58:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am missing the actual template files here, but I am starting to doubt the approach a bit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 (owner: 10Alexandros Kosiaris) [11:58:32] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:58:39] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:01:05] (03PS1) 10Btullis: Add icinga commands to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) [12:01:23] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:01:27] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:03:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P56531 and previous config saved to /var/cache/conftool/dbconfig/20240208-120306-marostegui.json [12:04:01] (03PS2) 10Btullis: Add icinga configuration to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) [12:05:18] (03PS3) 10Btullis: Add icinga configuration to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) [12:05:49] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@6a64b3d]: restbase: Disable parsoid storage for jawiki [12:05:51] (03CR) 10Btullis: "Rebased on master, so that I can merge this before adding the contacts in the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [12:06:24] (03PS2) 10Muehlenhoff: mediawiki: Remove Ferm-specific syntax from firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/945776 [12:10:18] (03PS11) 10Majavah: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [12:10:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. https://docs.djangoproject.com/en/5.0/ref/settings/#std-setting-STATICFILES_STORAGE mentions this is deprecated in current ver" [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 (owner: 10Slyngshede) [12:11:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1336/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [12:12:15] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki: Remove Ferm-specific syntax from firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/945776 (owner: 10Muehlenhoff) [12:13:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1337/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [12:15:49] (03PS2) 10Alexandros Kosiaris: modules.mesh: Sort based on name+version [deployment-charts] - 10https://gerrit.wikimedia.org/r/998797 [12:15:51] (03PS4) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module ("copy" change) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 [12:15:53] (03PS6) 10Alexandros Kosiaris: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 [12:15:55] (03PS2) 10Alexandros Kosiaris: modules: Bump usages of base.meta to 2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 [12:15:57] (03PS9) 10Alexandros Kosiaris: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:17:22] (03PS2) 10Btullis: Update default role contactgroups from analytics to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998826 (https://phabricator.wikimedia.org/T342578) [12:17:24] (03PS1) 10Btullis: Update contactgroups for analytics => team-data-platform icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/998842 (https://phabricator.wikimedia.org/T342578) [12:17:26] (03CR) 10Majavah: [V: 03+1] P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [12:18:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P56532 and previous config saved to /var/cache/conftool/dbconfig/20240208-121813-marostegui.json [12:21:39] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@6a64b3d]: restbase: Disable parsoid storage for jawiki (duration: 15m 49s) [12:22:07] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [12:22:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I 've cut the gordian knot in PS9 and in the dependent changes. Given this patch was the current incentive to undo that knot, I feel a bit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:22:39] (03PS1) 10Muehlenhoff: ci: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/998848 [12:23:28] (03PS1) 10Giuseppe Lavagetto: modules: promote to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998851 [12:23:30] (03PS1) 10Giuseppe Lavagetto: benthos: upgrade to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998852 [12:23:32] (03PS1) 10Giuseppe Lavagetto: echoserver: update to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998853 [12:23:53] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: increase downtime to 180 minutes [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [12:27:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998848 (owner: 10Muehlenhoff) [12:29:10] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: increase downtime to 180 minutes [cookbooks] - 10https://gerrit.wikimedia.org/r/998793 (https://phabricator.wikimedia.org/T356968) (owner: 10Jelto) [12:31:29] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/998822 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [12:31:52] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/998823 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [12:33:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T355609)', diff saved to https://phabricator.wikimedia.org/P56533 and previous config saved to /var/cache/conftool/dbconfig/20240208-123320-marostegui.json [12:33:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:33:26] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:33:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:33:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56534 and previous config saved to /var/cache/conftool/dbconfig/20240208-123343-marostegui.json [12:34:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] modules: Bump usages of base.meta to 2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 (owner: 10Alexandros Kosiaris) [12:35:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet [12:37:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet [12:38:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [12:38:31] (03Abandoned) 10Giuseppe Lavagetto: modules: promote to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998851 (owner: 10Giuseppe Lavagetto) [12:41:11] (03PS2) 10Giuseppe Lavagetto: benthos: upgrade to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998852 [12:41:13] (03PS2) 10Giuseppe Lavagetto: echoserver: update to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998853 [12:41:14] PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-ca.service,sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:26] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:50] RECOVERY - Check systemd state on puppetserver1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:27] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:40] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove Icinga-based systemd unit failed check [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:50:25] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) >>! In T356788#9523906, @fgiunchedi wrote: > Current avenues I'm exploring: > * Tighten the memory limits, `thanos-query` me... [12:54:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) [12:55:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [12:57:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56535 and previous config saved to /var/cache/conftool/dbconfig/20240208-125723-marostegui.json [12:57:28] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:59:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, see inline though" [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [12:59:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Update default role contactgroups from analytics to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998826 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1300) [13:02:51] (03CR) 10Filippo Giunchedi: [C: 03+2] nrpe: absent systemd_scripts [puppet] - 10https://gerrit.wikimedia.org/r/998822 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:03:49] (03PS1) 10Dreamy Jazz: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998703 (https://phabricator.wikimedia.org/T356047) [13:07:04] (03PS1) 10Dreamy Jazz: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998704 (https://phabricator.wikimedia.org/T356047) [13:10:37] (03PS1) 10Marostegui: filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/998870 (https://phabricator.wikimedia.org/T356988) [13:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P56536 and previous config saved to /var/cache/conftool/dbconfig/20240208-131229-marostegui.json [13:13:08] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudweb2002-dev.wikimedia.org with OS bullseye [13:15:21] (03PS4) 10Btullis: Add icinga configuration to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) [13:15:28] (03CR) 10Arnaudb: [C: 03+1] filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/998870 (https://phabricator.wikimedia.org/T356988) (owner: 10Marostegui) [13:15:46] (03CR) 10Btullis: Add icinga configuration to notify the Data Platform SRE team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [13:15:58] (03CR) 10Ladsgroup: [C: 03+1] filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/998870 (https://phabricator.wikimedia.org/T356988) (owner: 10Marostegui) [13:16:52] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/998870 (https://phabricator.wikimedia.org/T356988) (owner: 10Marostegui) [13:21:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:21:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:24:06] (03CR) 10Brouberol: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/998842 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [13:24:15] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [13:24:44] (03CR) 10Brouberol: [C: 03+2] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [13:27:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P56537 and previous config saved to /var/cache/conftool/dbconfig/20240208-132736-marostegui.json [13:31:38] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [13:31:40] (03PS1) 10Muehlenhoff: Remove puppetmaster2003 from Puppet 5 setup [puppet] - 10https://gerrit.wikimedia.org/r/998894 (https://phabricator.wikimedia.org/T356991) [13:31:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [13:33:23] (03CR) 10Muehlenhoff: "jhat" [puppet] - 10https://gerrit.wikimedia.org/r/998894 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [13:34:45] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [13:35:10] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [13:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:03] (03PS3) 10Muehlenhoff: ulogd: Make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) [13:37:24] (03PS4) 10Muehlenhoff: ulogd: Make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) [13:37:50] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [13:38:06] (03CR) 10Jelto: [C: 03+2] Update buildkitd image references [puppet] - 10https://gerrit.wikimedia.org/r/998493 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [13:38:23] (03CR) 10Jelto: [C: 03+2] Revert "Temporarily enable Dockerfile frontend on trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/998495 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [13:40:27] (03CR) 10Filippo Giunchedi: [C: 03+1] Add icinga configuration to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [13:41:42] (03CR) 10Filippo Giunchedi: [C: 03+2] nrpe: cleanup check_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998823 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:41:52] (03PS2) 10Filippo Giunchedi: nrpe: cleanup check_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998823 (https://phabricator.wikimedia.org/T337831) [13:42:22] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] nrpe: cleanup check_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998823 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:42:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56538 and previous config saved to /var/cache/conftool/dbconfig/20240208-134243-marostegui.json [13:42:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:42:48] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:42:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:45:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:46:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:46:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This is a noop, merging to get it out of my review queue" [deployment-charts] - 10https://gerrit.wikimedia.org/r/998797 (owner: 10Alexandros Kosiaris) [13:47:29] (03Merged) 10jenkins-bot: modules.mesh: Sort based on name+version [deployment-charts] - 10https://gerrit.wikimedia.org/r/998797 (owner: 10Alexandros Kosiaris) [13:47:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the +1. Merging alongside the dependent change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [13:47:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] base.meta: Remove dependency on the mesh module ("copy" change) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 (owner: 10Alexandros Kosiaris) [13:48:37] (03Merged) 10jenkins-bot: base.meta: Remove dependency on the mesh module ("copy" change) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991368 (owner: 10Alexandros Kosiaris) [13:48:42] (03Merged) 10jenkins-bot: base.meta: Remove dependency on the mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [13:49:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 (owner: 10Alexandros Kosiaris) [13:49:13] (03CR) 10CI reject: [V: 04-1] modules: Bump usages of base.meta to 2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 (owner: 10Alexandros Kosiaris) [13:49:18] (03PS1) 10Majavah: wikitech: remove python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/998912 (https://phabricator.wikimedia.org/T356966) [13:50:28] (03PS3) 10Alexandros Kosiaris: modules: Bump usages of base.meta to 2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998798 [13:50:46] (03CR) 10Muehlenhoff: ulogd: Make class ensurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [13:51:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [13:51:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [13:51:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T355609)', diff saved to https://phabricator.wikimedia.org/P56539 and previous config saved to /var/cache/conftool/dbconfig/20240208-135142-marostegui.json [13:51:47] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:54:33] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on 7 hosts with reason: due for decomm [13:57:26] (03PS1) 10Majavah: wikitech: use php::extension for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) [13:57:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on 7 hosts with reason: due for decomm [13:57:45] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=300e0d79-b255-4dee-bbb8-2f1bda51d32f) set by mvernon@cumin2002 for 8 days, 0:00:00 on 7 host(s) and their services with reason: due for decomm ` ms-be[10... [13:58:16] (03CR) 10Btullis: [C: 03+2] Add icinga configuration to notify the Data Platform SRE team [puppet] - 10https://gerrit.wikimedia.org/r/998838 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [13:58:24] !log disable puppet and stop swift on ms-be10[44-50] T353149 [13:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:28] T353149: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 [13:59:30] (03PS1) 10Filippo Giunchedi: profile: absent check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/998924 (https://phabricator.wikimedia.org/T332764) [13:59:33] (JobUnavailable) resolved: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1400). [14:00:04] MatmaRex, Dreamy_Jazz, and toni_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] hi [14:00:26] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine to go away" [puppet] - 10https://gerrit.wikimedia.org/r/998912 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:01:24] (03PS3) 10Btullis: Onboard the data-platform-sre team to Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [14:02:00] \o [14:02:02] (03CR) 10Majavah: [C: 03+2] wikitech: remove python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/998912 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:02:04] 👋 [14:02:36] I can self serve my patches. [14:03:11] hey. I'm reimaging cloudweb2002-dev.wikimedia.org, in case that fails in the sync phase please ignore. thanks and sorry. [14:03:31] Thanks for the heads up. [14:04:27] MatmaRex: Just to confirm, you don't have deployment access? [14:04:42] nope [14:04:45] If not, I can deploy for you. [14:05:04] that'd be nice, thanks [14:05:47] Can I be added to the relevant task so I can see why this patch is needed? [14:05:55] That is https://phabricator.wikimedia.org/T356884 [14:07:08] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [14:07:31] MatmaRex: For my above question. [14:07:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:07:48] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:07:53] (03CR) 10Muehlenhoff: [C: 03+1] wikitech: use php::extension for LDAP [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:08:01] Dreamy_Jazz: done [14:08:40] Thanks. [14:09:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998454 (https://phabricator.wikimedia.org/T356884) (owner: 10Bartosz Dziewoński) [14:09:54] (03PS2) 10Btullis: Add a new Icinga contact group for team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) [14:09:56] (03PS3) 10Btullis: Update default role contactgroups from analytics to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998826 (https://phabricator.wikimedia.org/T342578) [14:09:58] (03PS2) 10Btullis: Update contactgroups for analytics => team-data-platform icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/998842 (https://phabricator.wikimedia.org/T342578) [14:10:43] (03PS2) 10Majavah: wikitech: remove redundant php-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) [14:11:06] (03CR) 10Btullis: "I changed data-platform-irc to irc-data-platform for consistency with the other contacts." [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [14:11:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for the renaming, couple of inline comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:11:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] benthos: upgrade to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998852 (owner: 10Giuseppe Lavagetto) [14:12:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1340/co" [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:12:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:13:34] !log eoghan@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1002.eqiad.wmnet [14:13:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] echoserver: update to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998853 (owner: 10Giuseppe Lavagetto) [14:15:08] (03PS1) 10Majavah: P:openstack: horizon: do not install mod_wsgi [puppet] - 10https://gerrit.wikimedia.org/r/998926 (https://phabricator.wikimedia.org/T356966) [14:16:21] (03Merged) 10jenkins-bot: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998454 (https://phabricator.wikimedia.org/T356884) (owner: 10Bartosz Dziewoński) [14:16:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:16:34] (03CR) 10Majavah: [V: 03+1 C: 03+2] wikitech: remove redundant php-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/998921 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:16:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1341/co" [puppet] - 10https://gerrit.wikimedia.org/r/998926 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [14:16:38] (03CR) 10Dreamy Jazz: [C: 03+2] Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998453 (https://phabricator.wikimedia.org/T356884) (owner: 10Bartosz Dziewoński) [14:16:45] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998454|Parser: Fix the main loop getting stuck on some signatures (T356884)]] [14:17:34] (03PS1) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [14:17:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [14:18:17] !log dreamyjazz@deploy2002 dreamyjazz and matmarex: Backport for [[gerrit:998454|Parser: Fix the main loop getting stuck on some signatures (T356884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:51] MatmaRex: Please test [14:19:02] On a group0 wiki [14:19:05] looking [14:19:39] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb2002-dev.wikimedia.org with OS bullseye [14:19:51] Dreamy_Jazz: looks good [14:19:57] !log dreamyjazz@deploy2002 dreamyjazz and matmarex: Continuing with sync [14:20:00] Thanks. [14:20:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:45] (03CR) 10CI reject: [V: 04-1] Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [14:23:05] (03Merged) 10jenkins-bot: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998453 (https://phabricator.wikimedia.org/T356884) (owner: 10Bartosz Dziewoński) [14:23:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) >>! In T349619#9521720, @Volans wrote: > We could either catch the exception and retry or acquire a lock for all puppetserver ca operatio... [14:25:25] (SystemdUnitFailed) firing: (5) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:21] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998454|Parser: Fix the main loop getting stuck on some signatures (T356884)]] (duration: 09m 36s) [14:26:54] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998453|Parser: Fix the main loop getting stuck on some signatures (T356884)]] [14:28:25] !log dreamyjazz@deploy2002 dreamyjazz and matmarex: Backport for [[gerrit:998453|Parser: Fix the main loop getting stuck on some signatures (T356884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:26] MatmaRex: Please test on group1 or group2. Thanks. [14:28:50] !log eoghan@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM vrts1002.eqiad.wmnet [14:29:20] Dreamy_Jazz: also looks good [14:29:26] Thanks. [14:29:28] !log dreamyjazz@deploy2002 dreamyjazz and matmarex: Continuing with sync [14:29:44] thank you for deploying :) [14:29:52] Np. [14:30:13] (03PS1) 10Volans: cloud sso project: remove debmonitor client [puppet] - 10https://gerrit.wikimedia.org/r/998929 [14:30:26] (SystemdUnitFailed) firing: (41) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:55] (03CR) 10Muehlenhoff: [C: 03+1] cloud sso project: remove debmonitor client [puppet] - 10https://gerrit.wikimedia.org/r/998929 (owner: 10Volans) [14:31:27] (03PS1) 10Muehlenhoff: Make sretest1001 a Cumin node for a test [puppet] - 10https://gerrit.wikimedia.org/r/998930 (https://phabricator.wikimedia.org/T356174) [14:31:42] (03PS2) 10Muehlenhoff: Make sretest1001 a Cumin node for a test [puppet] - 10https://gerrit.wikimedia.org/r/998930 (https://phabricator.wikimedia.org/T356174) [14:31:44] toni_: Will you be able to test your change? [14:32:17] yes I can! [14:32:37] (03PS4) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) [14:32:54] (03CR) 10Ssingh: [C: 03+1] "Thanks, that works for us!" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [14:33:05] (03CR) 10Volans: [C: 03+2] cloud sso project: remove debmonitor client [puppet] - 10https://gerrit.wikimedia.org/r/998929 (owner: 10Volans) [14:33:09] Great. I will do your config change before my backports. [14:33:52] (03CR) 10Dreamy Jazz: [C: 03+2] Add edit_interaction stream config for iOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998416 (https://phabricator.wikimedia.org/T355265) (owner: 10Tsevener) [14:34:38] (03Merged) 10jenkins-bot: Add edit_interaction stream config for iOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998416 (https://phabricator.wikimedia.org/T355265) (owner: 10Tsevener) [14:34:50] thanks [14:35:23] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998453|Parser: Fix the main loop getting stuck on some signatures (T356884)]] (duration: 08m 29s) [14:35:25] (SystemdUnitFailed) firing: (53) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:47] (03PS2) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [14:35:52] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998416|Add edit_interaction stream config for iOS (T355265)]] [14:35:57] T355265: [L] Create schema docs and instrumentation plan for native editing - https://phabricator.wikimedia.org/T355265 [14:37:05] (03CR) 10CI reject: [V: 04-1] Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [14:37:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) @cmooney the SFP failed. I've replaced it and it looks to be up now. [14:37:31] !log dreamyjazz@deploy2002 tsev and dreamyjazz: Backport for [[gerrit:998416|Add edit_interaction stream config for iOS (T355265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:33] toni_: Please test. [14:37:52] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10Jclark-ctr) a:03VRiley-WMF [14:39:11] looks good [14:39:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:36] (03PS2) 10Effie Mouzeli: [BETA HACK] confd: Fix confd hostname [puppet] - 10https://gerrit.wikimedia.org/r/941478 (owner: 10Krinkle) [14:39:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T355609)', diff saved to https://phabricator.wikimedia.org/P56540 and previous config saved to /var/cache/conftool/dbconfig/20240208-143951-marostegui.json [14:39:55] Thanks. [14:39:58] !log dreamyjazz@deploy2002 tsev and dreamyjazz: Continuing with sync [14:40:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:40:26] (SystemdUnitFailed) firing: (68) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:35] (03CR) 10Dreamy Jazz: [C: 03+2] MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998704 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [14:42:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] mcrouter: add chart (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:42:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:05] (03Merged) 10jenkins-bot: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998704 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [14:44:15] (03PS2) 10Majavah: P:openstack: horizon: do not install mod_wsgi [puppet] - 10https://gerrit.wikimedia.org/r/998926 (https://phabricator.wikimedia.org/T356966) [14:44:17] (03PS1) 10Majavah: openstack: horizon: fix default policy path [puppet] - 10https://gerrit.wikimedia.org/r/998938 (https://phabricator.wikimedia.org/T341640) [14:44:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:26] (SystemdUnitFailed) firing: (69) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:05] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998416|Add edit_interaction stream config for iOS (T355265)]] (duration: 10m 12s) [14:46:10] T355265: [L] Create schema docs and instrumentation plan for native editing - https://phabricator.wikimedia.org/T355265 [14:46:54] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998704|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] [14:46:58] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [14:48:24] thanks Dreamy_Jazz [14:48:27] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:998704|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:48:31] No problem [14:48:38] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:49:42] (03CR) 10Btullis: [C: 03+2] Add a new Icinga contact group for team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998796 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [14:50:26] (SystemdUnitFailed) firing: (56) prometheus-phpfpm-statustext-textfile.service Failed on mw1357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:43] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998704|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] (duration: 07m 49s) [14:54:47] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [14:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P56541 and previous config saved to /var/cache/conftool/dbconfig/20240208-145457-marostegui.json [14:55:16] !log Running `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=testwiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-testwiki-sleep-30-no-render-now.txt` [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:26] (SystemdUnitFailed) firing: (68) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:41] (SystemdUnitFailed) firing: (68) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:01] Will go slightly over the window as I want to test that the backport works using that mwscript [14:59:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:26] (SystemdUnitFailed) resolved: (59) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:54] (03PS1) 10Clément Goubert: eventstreams: Raise memory limit to 1100Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/998945 (https://phabricator.wikimedia.org/T357005) [15:03:25] !log dbmaint Schema change on s1@codfw T356988 [15:03:27] !log dbmaint Schema change on s2@codfw T356988 [15:03:29] !log dbmaint Schema change on s6@codfw T356988 [15:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] T356988: Add gbw_target_central_id to global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T356988 [15:03:30] !log dbmaint Schema change on s8@codfw T356988 [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:01] !log testwiki scan finished [15:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:05] (03CR) 10Btullis: [C: 03+2] Update default role contactgroups from analytics to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/998826 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [15:04:54] (03PS3) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [15:05:04] !log Stopped mediamoderation scanning script for commonswiki [15:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:10] !log dbmaint (retroactive logging) Schema change on s7@codfw T356987 [15:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:14] T356987: Add gb_target_central_id to the globalblocks table on the centralauth DB - https://phabricator.wikimedia.org/T356987 [15:05:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998703 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:06:10] (03CR) 10CI reject: [V: 04-1] Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [15:07:54] (03Merged) 10jenkins-bot: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998703 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:08:18] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998703|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] [15:08:22] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [15:08:28] !log dbmaint Schema change on s7@codfw T356988 [15:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:31] T356988: Add gbw_target_central_id to global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T356988 [15:09:48] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:998703|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:10:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P56542 and previous config saved to /var/cache/conftool/dbconfig/20240208-151005-marostegui.json [15:10:14] (03PS4) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [15:10:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:10:37] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [15:11:38] (03PS1) 10Muehlenhoff: Failover idp to 1002 [dns] - 10https://gerrit.wikimedia.org/r/998949 [15:11:53] (03CR) 10CI reject: [V: 04-1] Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [15:13:51] !log dbmaint Schema change on s5@codfw T356988 [15:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:55] T356988: Add gbw_target_central_id to global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T356988 [15:15:24] (03PS1) 10Volans: cloud sso project: decommission sso-debmon [puppet] - 10https://gerrit.wikimedia.org/r/998950 [15:15:55] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp to 1002 [dns] - 10https://gerrit.wikimedia.org/r/998949 (owner: 10Muehlenhoff) [15:16:15] (03PS5) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [15:16:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:55] (SystemdUnitFailed) firing: (17) prometheus-phpfpm-statustext-textfile.service Failed on mw1351:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:01] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998703|MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] (duration: 08m 42s) [15:17:05] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [15:17:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [15:17:32] (03CR) 10CI reject: [V: 04-1] Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [15:17:37] !log Running `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` on a tmux session - See https://wikitech.wikimedia.org/wiki/MediaModeration [15:17:38] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ops-monitoring-bot) [15:20:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10ssingh) moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :) [15:20:40] (SystemdUnitFailed) resolved: (23) prometheus-phpfpm-statustext-textfile.service Failed on mw1351:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:50] !log Afternoon backport window done [15:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:34] (03PS5) 10Muehlenhoff: ulogd: Make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) [15:21:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) >>! In T355544#9525282, @ssingh wrote: > moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my... [15:22:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [15:23:44] CUSTOM - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [15:24:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T355609)', diff saved to https://phabricator.wikimedia.org/P56543 and previous config saved to /var/cache/conftool/dbconfig/20240208-152511-marostegui.json [15:25:16] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:26:20] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:39] (03PS6) 10Muehlenhoff: Pass the ensure parameter to the Ferm logging class [puppet] - 10https://gerrit.wikimedia.org/r/998927 (https://phabricator.wikimedia.org/T356174) [15:27:44] CUSTOM - Memcached on an-tool1005 is OK: TCP OK - 0.001 second response time on 10.64.36.117 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [15:31:47] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ABran-WMF) p:05Triage→03Medium [15:31:51] (03PS2) 10Majavah: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) [15:31:57] (03CR) 10Majavah: [C: 03+2] Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [15:32:26] (03Merged) 10jenkins-bot: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [15:35:24] 10SRE-tools, 10Infrastructure-Foundations, 10Toolforge, 10cloud-services-team: spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10taavi) 05Stalled→03Declined [15:36:43] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ABran-WMF) indeed a disk is reported missing: ` arnaudb@db2194:~ $ sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication: 0 OK | controller: 1 Needs Attention | physical_disk: 0 OK | virtual_... [15:38:21] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [15:39:19] !log dbmaint Schema change on s4@codfw T356988 [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:23] T356988: Add gbw_target_central_id to global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T356988 [15:39:30] (03PS24) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [15:40:22] Need to perform a bug fix for the backports that I deployed for MediaModeration. [15:40:47] Will do that now as there isn't a clashing window. [15:40:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10MatthewVernon) >>! In T355544#9525282, @ssingh wrote: > moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please acce... [15:41:16] (03PS1) 10Dreamy Jazz: Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998966 (https://phabricator.wikimedia.org/T356047) [15:41:33] (03PS1) 10Dreamy Jazz: Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998967 (https://phabricator.wikimedia.org/T356047) [15:41:39] !log dbmaint Schema change on s3@codfw T356988 [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:46] (03CR) 10Dreamy Jazz: [C: 03+2] Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998966 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:42:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998967 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:43:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [15:44:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998967 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:44:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998966 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:44:30] (03PS1) 10Scott French: [Exercise - DNM] InitialiseSettings: set max upload size to 1 GiB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998957 [15:44:32] (03PS1) 10Scott French: [Exercise - DNM] CommonSettings: boost memory_limit 2x in k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998958 [15:44:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [15:44:34] (03PS1) 10Scott French: [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 [15:44:36] (03PS1) 10Scott French: [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 [15:44:38] (03PS1) 10Scott French: [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 [15:44:40] (03PS1) 10Scott French: [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 [15:44:42] (03PS1) 10Scott French: [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 [15:44:44] (03PS1) 10Scott French: [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 [15:44:46] (03PS1) 10Scott French: [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 [15:44:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [15:44:48] (03Merged) 10jenkins-bot: Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998966 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:44:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T355609)', diff saved to https://phabricator.wikimedia.org/P56544 and previous config saved to /var/cache/conftool/dbconfig/20240208-154452-marostegui.json [15:44:57] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:45:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998926 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [15:45:27] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] CommonSettings: boost memory_limit 2x in k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998958 (owner: 10Scott French) [15:45:39] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 (owner: 10Scott French) [15:45:41] (03Merged) 10jenkins-bot: Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file [extensions/MediaModeration] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998967 (https://phabricator.wikimedia.org/T356047) (owner: 10Dreamy Jazz) [15:45:43] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 (owner: 10Scott French) [15:45:51] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 (owner: 10Scott French) [15:45:53] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 (owner: 10Scott French) [15:46:08] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 (owner: 10Scott French) [15:46:09] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:998967|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]], [[gerrit:998966|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] [15:46:13] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [15:46:14] (03CR) 10Majavah: [C: 03+2] P:openstack: horizon: do not install mod_wsgi [puppet] - 10https://gerrit.wikimedia.org/r/998926 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [15:46:20] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 (owner: 10Scott French) [15:46:29] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 (owner: 10Scott French) [15:47:01] !log Stopped mediamoderation scanning script [15:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:07] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [15:47:11] (03PS1) 10Giuseppe Lavagetto: flink-app: update modules to recent versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998986 [15:47:13] (03PS1) 10Giuseppe Lavagetto: ipoid: upgrade to new modules versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998987 [15:47:15] (03PS1) 10Giuseppe Lavagetto: python-webapp: update module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998988 [15:47:16] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [15:47:20] (03PS1) 10Giuseppe Lavagetto: spark-history: fix package.json, update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/998989 [15:47:26] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [15:47:34] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:47:39] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:998967|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]], [[gerrit:998966|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:47:42] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [15:47:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:48:04] (03PS2) 10Majavah: openstack: horizon: fix default policy path [puppet] - 10https://gerrit.wikimedia.org/r/998938 (https://phabricator.wikimedia.org/T341640) [15:48:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10cmooney) >>! In T355867#9498001, @MatthewVernon wrote: > Once complete I'll want to check the backends, but t... [15:48:29] !log Draining mw2377.codfw.wmnet mw2378.codfw.wmnet mw2381.codfw.wmnet mw2395.codfw.wmnet mw2291.codfw.wmnet mw2292.codfw.wmnet mw2293.codfw.wmnet mw2294.codfw.wmnet mw2295.codfw.wmnet mw2296.codfw.wmnet mw2297.codfw.wmnet - T355870 [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:33] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [15:48:44] topranks: still scheduled for 1600 UTC B3 yeah? [15:48:57] Ah no not b3 [15:48:58] eh, A3 ? [15:49:02] a3 [15:49:04] mb [15:49:06] ha ok :) [15:49:15] wrong task, right hosts [15:49:18] yeah how is it looking? that massive list of mw hosts an issue? [15:49:25] nah we'll be ok [15:49:45] !log Draining mw2377.codfw.wmnet mw2378.codfw.wmnet mw2381.codfw.wmnet mw2395.codfw.wmnet mw2291.codfw.wmnet mw2292.codfw.wmnet mw2293.codfw.wmnet mw2294.codfw.wmnet mw2295.codfw.wmnet mw2296.codfw.wmnet mw2297.codfw.wmnet - T355862 [15:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:49] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [15:50:05] !log taavi@cumin1002 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::cloudweb [15:50:20] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) >>! In T355862#9523604, @Marostegui wrote: > The databases are ready to be moved any time. Great, thanks! [15:50:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mw2376:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:36] calime: great, I'll let you know when we're done [15:50:55] Should have put down something in the deployment calendar to block deployments though [15:51:07] (03PS1) 10Majavah: hieradata: migrate codfw1dev cloudweb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998991 [15:51:16] Dreamy_Jazz: do you still have patches to deploy? [15:51:21] I should be done shortly with my backport. [15:51:24] ack [15:51:43] Just waiting for php-fpm restarts and then I will be done with scap backport. [15:51:47] I'll wait to depool then [15:52:04] Great, that means I can just pooled=no instead of invalid and having to scap pull afterwardss [15:52:28] (03CR) 10Majavah: [C: 03+2] hieradata: migrate codfw1dev cloudweb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998991 (owner: 10Majavah) [15:52:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:53:05] (03PS1) 10Jgiannelos: changeprop: Disable restbase/parsoid related rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/998992 (https://phabricator.wikimedia.org/T344945) [15:53:56] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this patch until restbase defaults to having storage disabled for parsoid endpoints." [deployment-charts] - 10https://gerrit.wikimedia.org/r/998992 (https://phabricator.wikimedia.org/T344945) (owner: 10Jgiannelos) [15:54:13] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:998967|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]], [[gerrit:998966|Follow-up: MediaModerationImageContentsLookup: use proxied HTTP request to generate file (T356047)]] (duration: 08m 03s) [15:54:17] T356047: Attempt to generate thumbnail by requesting URL - https://phabricator.wikimedia.org/T356047 [15:54:26] Done. Over to you :) [15:54:32] ty! [15:54:40] I'll just be messing around with maintenance scripts now. [15:54:48] Nothing that should clash though. [15:55:21] hmm actually I may hit the depool threshold if I just depool, I'll inactive them [15:55:25] (SystemdUnitFailed) firing: (38) prometheus-phpfpm-statustext-textfile.service Failed on mw1365:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:40] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=(mw2379|mw2380|mw2382|mw2383|mw2384|mw2385|mw2386|mw2387|mw2388|mw2389|mw2390|mw2391|mw2392|mw2393|mw2394|mw2396|mw2397|mw2398|mw2399|mw2400|mw2298|mw2299|mw2300).* [15:55:49] !log Running `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=testwiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-testwiki-sleep-30-no-render-now.txt` [15:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:08] !log Depooled mw2379|mw2380|mw2382|mw2383|mw2384|mw2385|mw2386|mw2387|mw2388|mw2389|mw2390|mw2391|mw2392|mw2393|mw2394|mw2396|mw2397|mw2398|mw2399|mw2400|mw2298|mw2299|mw2300 - T355862 [15:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:12] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [15:56:17] topranks: all good on my end [15:56:31] claime: super, thank you! [15:56:33] minus 1 or 2 minutes to drain remaining connections [15:56:40] sure [15:56:48] we won't start till the top of the hour anyway [15:57:13] ack [15:57:25] (03PS25) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [15:57:31] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::cloudweb [15:57:33] !log Running `foreachwikindblist group0.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-group0-sleep-30-thumbor.txt` [15:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:37] !log moving Netbox server uplinks from asw-a3-codfw to lsw1-a3-codfw to prep config for server moves T355862 [15:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:10] (03CR) 10Bking: [C: 03+1] service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483) (owner: 10Brouberol) [15:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P56545 and previous config saved to /var/cache/conftool/dbconfig/20240208-155833-root.json [15:58:42] (03PS26) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [16:00:26] (SystemdUnitFailed) resolved: (44) prometheus-phpfpm-statustext-textfile.service Failed on mw1365:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:02:22] (03CR) 10Btullis: [C: 03+2] Update contactgroups for analytics => team-data-platform icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/998842 (https://phabricator.wikimedia.org/T342578) (owner: 10Btullis) [16:02:25] (03CR) 10Volans: [C: 03+2] cloud sso project: decommission sso-debmon [puppet] - 10https://gerrit.wikimedia.org/r/998950 (owner: 10Volans) [16:03:20] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:03:32] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:03:35] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:03:36] !log installing pillow security updates [16:03:39] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:03:44] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:03:49] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - **The reimage fa... [16:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:06] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:04:27] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:04:31] (03CR) 10Dzahn: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/998848 (owner: 10Muehlenhoff) [16:05:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [16:06:19] (03PS1) 10Cwhite: profile: use undef default for active and standby host defaults [puppet] - 10https://gerrit.wikimedia.org/r/998285 (https://phabricator.wikimedia.org/T352665) [16:06:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10jhathaway) This set of steps looks correct to me, based on what @jbond did on T345067 [16:07:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Correct, although arguably there's also the limit in initalisesettings-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998957 (owner: 10Scott French) [16:07:36] (03CR) 10Dzahn: [C: 03+1] Remove puppetmaster2003 from Puppet 5 setup [puppet] - 10https://gerrit.wikimedia.org/r/998894 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [16:07:44] !log Commencing server uplink moves from old switch to new in codfw rack A3 T355862 [16:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:53] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [16:07:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Correct, minus the formatting issues 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998958 (owner: 10Scott French) [16:08:00] (03CR) 10Alexandros Kosiaris: "I don't even understand why it needs 1G tbh. +1ing cause it won't hurt, but see my comments in the task." [deployment-charts] - 10https://gerrit.wikimedia.org/r/998945 (https://phabricator.wikimedia.org/T357005) (owner: 10Clément Goubert) [16:08:04] (03CR) 10JHathaway: [C: 03+1] Remove puppetmaster2003 from Puppet 5 setup [puppet] - 10https://gerrit.wikimedia.org/r/998894 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [16:08:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 (owner: 10Scott French) [16:08:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:08:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 (owner: 10Scott French) [16:09:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a3-codfw.mgmt with reason: server uplink migration codfw rack a3 [16:09:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 (owner: 10Scott French) [16:09:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a3-codfw.mgmt with reason: server uplink migration codfw rack a3 [16:09:39] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 39 hosts with reason: Migrating servers in codfw rack A3 to lsw1-a3-codfw [16:09:39] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a24ae7f4-1952-434f-9ee8-3ff0973f1444) set by cmooney@cumin... [16:10:07] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:10:11] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) 05In progress→03Stalled @AndrewTavis_WMDE Ok, thanks for the update. Confirmed. Keeping open and just setting to stalled for the mome... [16:10:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 39 hosts with reason: Migrating servers in codfw rack A3 to lsw1-a3-codfw [16:10:15] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:10:21] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=06c4fbb3-382e-4660-b308-79bf9f5106d5) set by cmooney@cumin... [16:10:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The exercise" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 (owner: 10Scott French) [16:11:04] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:03cchen [16:11:19] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Open→03In progress [16:11:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Dzahn) 05Open→03In progress [16:12:04] 10SRE, 10Traffic: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951 (10LSobanski) [16:12:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppetmaster2003 from Puppet 5 setup [puppet] - 10https://gerrit.wikimedia.org/r/998894 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [16:12:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] [Exercise - DNM] Add a new pool to wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 (owner: 10Scott French) [16:12:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "As I said, the correct way is to use etcd. Well done anyways." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 (owner: 10Scott French) [16:13:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "well done!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 (owner: 10Scott French) [16:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P56546 and previous config saved to /var/cache/conftool/dbconfig/20240208-161338-root.json [16:14:21] (03CR) 10Andrew Bogott: [C: 03+1] "Ugh, this reminds me of my fruitless attempt to get the trove dashboard to include one of these :(" [puppet] - 10https://gerrit.wikimedia.org/r/998938 (https://phabricator.wikimedia.org/T341640) (owner: 10Majavah) [16:15:52] !log Running `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` on a tmux session - See https://wikitech.wikimedia.org/wiki/MediaModeration [16:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventstreams: Raise memory limit to 1100Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/998945 (https://phabricator.wikimedia.org/T357005) (owner: 10Clément Goubert) [16:23:11] !log Server move completed codfw rack A3 T355862 [16:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:23] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [16:23:56] (03CR) 10Majavah: [C: 03+2] openstack: horizon: fix default policy path [puppet] - 10https://gerrit.wikimedia.org/r/998938 (https://phabricator.wikimedia.org/T341640) (owner: 10Majavah) [16:26:02] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:26:10] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:28:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10ssingh) [16:28:06] 10SRE-swift-storage, 10Commons, 10Structured-Data-Backlog, 10UploadWizard, 10Wikimedia-production-error: Uploadwizard sometimes fails "Internal error: Server failed to publish temporary file" - https://phabricator.wikimedia.org/T353871 (10Krinkle) [16:28:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P56547 and previous config saved to /var/cache/conftool/dbconfig/20240208-162843-root.json [16:30:12] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10ssingh) As discussed in [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/998431 | 998431 ]], Traffic will be taking care of `conf2004`, s... [16:31:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:31:27] (03CR) 10Ssingh: [C: 03+1] "When this patch is merged, please restart pybal.service in each of the LVS host in codfw." [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [16:31:34] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:31:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) Good catch! Unfortunately I'm still seeing the same PXE behaviour failing on boot [16:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [16:31:58] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:32:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:33:11] (03PS1) 10Hnowlan: kubernetes: move 5 mw hosts to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/998996 (https://phabricator.wikimedia.org/T351074) [16:36:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56548 and previous config saved to /var/cache/conftool/dbconfig/20240208-163624-root.json [16:36:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56549 and previous config saved to /var/cache/conftool/dbconfig/20240208-163642-root.json [16:36:44] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) Work completed! No errors to report all working well. [16:37:01] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw-a-codfw,cr[1-2]-codfw,lsw1-a3-codfw.mgmt [16:37:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw-a-codfw,cr[1-2]-codfw,lsw1-a3-codfw.mgmt [16:39:16] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) Thanks - I am starting to repool the databases. [16:39:25] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10VRiley-WMF) [16:40:16] 10SRE, 10Cloud-VPS, 10cloud-services-team: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10cmooney) 05Open→03Declined Closing this one, let's discuss on duplicate T356986 (sorry bout that!) [16:40:42] !log Uncordoning mw2377.codfw.wmnet mw2378.codfw.wmnet mw2381.codfw.wmnet mw2395.codfw.wmnet mw2291.codfw.wmnet mw2292.codfw.wmnet mw2293.codfw.wmnet mw2294.codfw.wmnet mw2295.codfw.wmnet mw2296.codfw.wmnet mw2297.codfw.wmnet - T355862 [16:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:47] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [16:43:44] (03PS6) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [16:43:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P56550 and previous config saved to /var/cache/conftool/dbconfig/20240208-164348-root.json [16:44:26] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10VRiley-WMF) [16:47:44] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:04] !log Repooling mw2379|mw2380|mw2382|mw2383|mw2384|mw2385|mw2386|mw2387|mw2388|mw2389|mw2390|mw2391|mw2392|mw2393|mw2394|mw2396|mw2397|mw2398|mw2399|mw2400|mw2298|mw2299|mw2300 - T355862 [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:09] T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 [16:48:24] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2379|mw2380|mw2382|mw2383|mw2384|mw2385|mw2386|mw2387|mw2388|mw2389|mw2390|mw2391|mw2392|mw2393|mw2394|mw2396|mw2397|mw2398|mw2399|mw2400|mw2298|mw2299|mw2300).* [16:49:26] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mw2385:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56551 and previous config saved to /var/cache/conftool/dbconfig/20240208-165129-root.json [16:51:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [16:51:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56552 and previous config saved to /var/cache/conftool/dbconfig/20240208-165147-root.json [16:53:30] (03PS1) 10Majavah: openstack: fix policy defaults on the live version too [puppet] - 10https://gerrit.wikimedia.org/r/999000 [16:54:14] (03PS2) 10Majavah: openstack: fix policy defaults on the live version too [puppet] - 10https://gerrit.wikimedia.org/r/999000 (https://phabricator.wikimedia.org/T356966) [16:54:25] (SystemdUnitFailed) resolved: prometheus-phpfpm-statustext-textfile.service Failed on mw2385:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:34] (03PS2) 10Scott French: [Exercise - DNM] CommonSettings: boost memory_limit 2x in k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998958 [16:54:36] (03PS2) 10Scott French: [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 [16:54:38] (03PS2) 10Scott French: [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 [16:54:41] (03PS2) 10Scott French: [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 [16:54:43] (03PS2) 10Scott French: [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 [16:54:45] (03PS2) 10Scott French: [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 [16:54:47] (03PS2) 10Scott French: [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 [16:54:49] (03PS2) 10Scott French: [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 [16:55:51] (03CR) 10Majavah: [C: 03+2] openstack: fix policy defaults on the live version too [puppet] - 10https://gerrit.wikimedia.org/r/999000 (https://phabricator.wikimedia.org/T356966) (owner: 10Majavah) [16:57:35] jhathaway, rzl: i don't see anything currently listed for the puppet req window and would like to move train to group1 if i'm not stepping on any toes here... [16:58:02] brennen: sounds good to me [16:58:32] thx, going ahead. [16:58:50] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 (owner: 10Scott French) [16:58:52] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 (owner: 10Scott French) [16:58:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P56553 and previous config saved to /var/cache/conftool/dbconfig/20240208-165853-root.json [16:58:56] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 (owner: 10Scott French) [16:59:01] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 (owner: 10Scott French) [16:59:04] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 (owner: 10Scott French) [16:59:06] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 (owner: 10Scott French) [16:59:11] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 (owner: 10Scott French) [17:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:51] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999001 (https://phabricator.wikimedia.org/T354435) [17:00:53] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999001 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [17:00:59] (03CR) 10Btullis: [C: 03+1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:01:12] !log train 1.42.0-wmf.17 (T354435): blockers resolved, rolling to group1 [17:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [17:01:38] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999001 (https://phabricator.wikimedia.org/T354435) (owner: 10TrainBranchBot) [17:05:25] (03CR) 10Brouberol: [C: 03+2] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:05:57] (03CR) 10Cwhite: [C: 03+2] profile: use undef default for active and standby host defaults [puppet] - 10https://gerrit.wikimedia.org/r/998285 (https://phabricator.wikimedia.org/T352665) (owner: 10Cwhite) [17:06:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56554 and previous config saved to /var/cache/conftool/dbconfig/20240208-170634-root.json [17:06:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56555 and previous config saved to /var/cache/conftool/dbconfig/20240208-170651-root.json [17:08:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mw1444:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:01] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.17 refs T354435 [17:09:17] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [17:09:40] (SystemdUnitFailed) firing: (11) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:25] (SystemdUnitFailed) resolved: (23) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P56556 and previous config saved to /var/cache/conftool/dbconfig/20240208-171358-root.json [17:14:06] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10wiki_willy) a:03Jhancock.wm ++ @Jhancock.wm [17:15:54] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.17 refs T354435 (duration: 06m 52s) [17:15:58] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [17:18:55] (SystemdUnitFailed) firing: (55) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:41] (SystemdUnitFailed) resolved: (55) prometheus-phpfpm-statustext-textfile.service Failed on mw1353:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:41] (03PS4) 10Btullis: Onboard the data-platform-sre team to Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [17:20:41] (03CR) 10Btullis: Onboard the data-platform-sre team to Alertmanager (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [17:21:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56557 and previous config saved to /var/cache/conftool/dbconfig/20240208-172139-root.json [17:21:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56558 and previous config saved to /var/cache/conftool/dbconfig/20240208-172156-root.json [17:28:42] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: Decommission sessionstore100[1-3] - https://phabricator.wikimedia.org/T356719 (10VRiley-WMF) 05Open→03Resolved [17:29:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P56559 and previous config saved to /var/cache/conftool/dbconfig/20240208-172902-root.json [17:31:05] (03PS22) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [17:31:53] (03CR) 10Andrew Bogott: "looks great! One minor doc request" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [17:32:32] (03PS3) 10Scott French: [Exercise - DNM] ProductionServices: add foobar service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998959 [17:32:34] (03PS3) 10Scott French: [Exercise - DNM] depool pc1011 as pc1 and replace with spare pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998960 [17:32:36] (03PS3) 10Scott French: [Exercise - DNM] Add account creation throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998961 [17:32:38] (03PS3) 10Scott French: [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 [17:32:40] (03PS3) 10Scott French: [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 [17:32:42] (03PS3) 10Scott French: [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 [17:32:44] (03PS3) 10Scott French: [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 [17:33:41] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add itwiki to flaggedrevs dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998962 (owner: 10Scott French) [17:33:59] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Add a new pool to wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998963 (owner: 10Scott French) [17:34:16] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Set all wikis to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998964 (owner: 10Scott French) [17:34:18] (03CR) 10CI reject: [V: 04-1] [Exercise - DNM] Wire shellbox instance for 'video' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998965 (owner: 10Scott French) [17:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:36:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56560 and previous config saved to /var/cache/conftool/dbconfig/20240208-173644-root.json [17:36:50] (03PS7) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [17:37:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56561 and previous config saved to /var/cache/conftool/dbconfig/20240208-173701-root.json [17:37:36] (03CR) 10Majavah: openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [17:38:04] jouncebot: next [17:38:04] In 0 hour(s) and 21 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1800) [17:38:04] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1800) [17:39:25] brennen: are you going to deploy in an hour 20 or so? [17:40:19] yeah, i'll go to all wikis at the usual time [17:40:20] (03CR) 10CI reject: [V: 04-1] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [17:40:32] ...assuming nothing breaks before then [17:40:56] (03CR) 10BCornwall: Add module for ncmonitor (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [17:40:58] brennen: so the hour before that is SRE / mediawiki infrastructure and I have a change that is borderline mw infra and mediawiki or something [17:40:58] (03PS1) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) [17:41:00] (03PS8) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [17:41:05] brennen: well, actually it's deployment system [17:41:23] change in modules/scap/files/foreachwikiindblist [17:41:40] it's adding "usage" https://gerrit.wikimedia.org/r/c/operations/puppet/+/992263/4/modules/scap/files/foreachwikiindblist [17:42:17] mutante: this feels pretty unlikely to break anything, but assuming it does that looks like a pretty quick rollback, right? [17:42:19] I could merge and just double checking all is normal [17:42:27] yea [17:42:32] yeah, cool by me. [17:42:37] ok, doing it now then [17:42:47] (03CR) 10Dzahn: [C: 03+2] foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [17:45:29] !log deploy1002/deploy2002 - change in scap foreachwikiindblist deployed (gerrit:992263) [17:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] (03CR) 10Dzahn: [C: 03+2] "deployed on deployment hosts before upcoming deploy in roughly an hour" [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [17:49:29] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @Jelto I just reset wikitech account. Superset, hue and Jupyterhub access all work now. thank you! [17:50:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:45] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) [17:51:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56562 and previous config saved to /var/cache/conftool/dbconfig/20240208-175149-root.json [17:51:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:54] (03PS1) 10BryanDavis: toolhub: Bump container version to 2024-02-08-143714-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/999026 [17:52:00] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:05cchen→03None [17:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P56563 and previous config saved to /var/cache/conftool/dbconfig/20240208-175206-root.json [17:53:51] 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn) @DBu-WMF I think that other ticket I linked would be valuable info for you but I realized you currently don't have access to that. We can look into that. [17:54:38] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2024-02-08-143714-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/999026 (owner: 10BryanDavis) [17:55:53] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2024-02-08-143714-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/999026 (owner: 10BryanDavis) [17:58:36] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) p:05High→03Medium [17:59:50] (03PS2) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) [18:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1800) [18:00:40] I have a Toolhub build ready to push out [18:00:54] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:01:29] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:01:42] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:02:29] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:02:45] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:03:34] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:17:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:22:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:23:18] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:23:32] looking [18:24:13] dip on edits too [18:24:50] hnowlan: the dip in edits is very concerning but has been going on for a while [18:25:08] based on logstash, this looks to be connectivity trouble specifically with Spectrum Business (AS20115) [18:28:18] (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:35:53] cdanis, hnowlan: feeling like a good time for a rollback? [18:36:41] brennen: sgtm [18:36:45] Rollback of the train? It was 90 mins ago. [18:37:26] James_F: that drop on edit graph does correspond fairly closely to group1 rollout... [18:37:29] The peak in the exceptions started at 18:12. [18:37:49] SAL says group1 was 17:01? [18:37:54] James_F: https://sal.toolforge.org/production?p=0&q=1.42.0-wmf.17&d= https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=1707400738993&to=1707417237042 [18:37:57] the edit rate graph [18:38:05] finished 17:15:56, approx. [18:38:07] that's lower than it's been in like 3 months [18:38:21] I suspect wikidata issues tbh but I don't have anything other than a hunch [18:38:22] Ah, yes, indeed. [18:38:52] Bot edits still going through on WD. [18:38:55] And human ones. [18:39:05] But possibly much-delayed. [18:39:16] i think let's go back to group0 and see where we're at? [18:39:21] +1 [18:39:24] kk, doing. [18:39:35] There's a dblag page somewhere [18:40:09] https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&formatversion=2&siprop=dbrepllag&sishowalldb=true [18:40:11] ? [18:40:26] Yup, thanks. Those figures look fine? [18:40:37] they do [18:40:53] Do we know where the lower edit rate is happening? is it cluster-wide? Wiki-specific? Geo-specific? [18:41:20] I'm digging [18:42:20] Hi folks, I'd like to run this script in production: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CampaignEvents/+/refs/heads/master/maintenance/GenerateInvitationList.php It doesn't make any writes, but it makes large queries on the revision table. I'm seeing mentions of errors and DB lag, should I hold off? [18:42:30] Daimona: let's hold off for now please [18:43:08] That's what I thought, thank you. I'll ask again in ~2h after dinner, hopefully all will be fine by then :) [18:43:15] Hopefully. [18:44:25] (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mw2374:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:23] <_joe_> James_F: I was saying, recentchanges for wikidata seems healthy [18:46:28] Yeah. [18:46:50] <_joe_> so it seems a metrics issue rather than an actual issue [18:46:53] But given the immense edit rate there, even at 75% down it would still look active. [18:46:57] Or that. [18:47:12] it's already going back up [18:47:18] I'm guessing brennen is just starting to restart apaches now [18:47:23] we're at about 75% [18:47:37] <_joe_> James_F: I've counted ~ 400 edits in a minute earlier [18:47:46] Hmm, that sounds about nominal. [18:47:48] <_joe_> but we could just go to the database and count [18:48:10] <_joe_> but it's too late for me to play with the wikidata revisions table :) [18:48:12] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.17 refs T354435 [18:48:19] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [18:48:33] <_joe_> yeah, so edit counts are back to normal [18:49:06] I'm getting a 503 on https://meta.toolforge.org/accounteligibility/70, don't know if it's related [18:49:25] (SystemdUnitFailed) firing: (26) prometheus-phpfpm-statustext-textfile.service Failed on mw1369:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:31] <_joe_> sar: not related [18:49:36] <_joe_> sorry :) [18:50:07] the edit rate metric is back to the usual level [18:50:43] it does seem like a reporting issue but I'm not sure how -- I peeked the statslib changes in wmf.17 and nothing seemed like it could be a possibility [18:51:37] (which, tbh, I'm not even sure if edit save counts have been converted to statslib yet or not) [18:51:46] i await guidance on whether this should block wmf.17. [18:52:27] brennen: IMO it should, it's one of the five high-level signals we display on https://www.wikimediastatus.net/ for instance [18:52:37] yeah, that seems reasonable. [18:52:40] do we have a task yet? [18:52:50] I can file one [18:53:06] much appreciated [18:54:24] ah [18:54:26] (SystemdUnitFailed) resolved: (31) prometheus-phpfpm-statustext-textfile.service Failed on mw1366:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:27] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/990639 [18:54:52] okay I'll file a task AND I'll cc the right people :) [18:55:02] Aha. [18:55:13] cdanis: Great sleuthing. [18:55:26] just had to find the right thing to ctrl-f the changelog for ;) [18:58:59] nice find. [19:00:04] brennen and dancy: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1900). [19:00:30] lol thanks jouncebot [19:01:26] !log train 1.42.0-wmf.17 (T354435): currently rolled back to group0; blocked pending a fix for edit metrics (further details to come) [19:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:30] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [19:04:37] ok brennen I filed T357050 [19:04:37] T357050: editResponseTime's port to statslib is not actually backwards-compatible - https://phabricator.wikimedia.org/T357050 [19:05:00] thanks cdanis [19:40:14] i'm going for a ~20m stroll around the block, will check up on blocker task when i get back. [20:07:28] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:11:31] brennen: hi [20:16:24] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:16:26] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:17:59] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:18:36] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:24:04] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2003 - cmooney@cumin1002" [20:24:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2003 - cmooney@cumin1002" [20:24:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:37:32] hey cwhite, sorry i missed the earlier ping. [20:39:33] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:40:13] brennen: it looks to me like WikimediaEvents extension needs a revert [20:40:28] cwhite: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/998971 look ok? [20:40:42] that'll do it [20:40:53] cool [20:40:57] need +1/+2 from me [20:41:16] ? [20:41:43] the rubberstamp never hurts. :) [20:41:51] jouncebot nowandnext [20:41:51] For the next 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T1900) [20:41:51] In 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T2100) [20:42:03] i'll go ahead and sling out the backport. [20:42:29] right on, thank you! [20:43:36] will take train to group0 and then all wikis immediately following. i'll probably be stepping on the afternoon backport window, though i don't see any patches there yet anyhow. [20:45:58] !log brennen@deploy2002 Started scap: Backport for [[gerrit:998972|Revert "Migrate `editResponseTime` metric to Prometheus store" (T357050)]] [20:46:02] T357050: editResponseTime's port to statslib is not actually backwards-compatible - https://phabricator.wikimedia.org/T357050 [20:47:30] !log brennen@deploy2002 brennen: Backport for [[gerrit:998972|Revert "Migrate `editResponseTime` metric to Prometheus store" (T357050)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:44] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:49] !log brennen@deploy2002 brennen: Continuing with sync [20:54:25] (SystemdUnitFailed) firing: (4) prometheus-phpfpm-statustext-textfile.service Failed on mw1403:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:59] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2003 - cmooney@cumin1002" [20:55:15] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:998972|Revert "Migrate `editResponseTime` metric to Prometheus store" (T357050)]] (duration: 09m 17s) [20:55:19] T357050: editResponseTime's port to statslib is not actually backwards-compatible - https://phabricator.wikimedia.org/T357050 [20:55:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2003 - cmooney@cumin1002" [20:55:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:58:44] will remove that task as a blocker once i confirm group1 doesn't impact the edit metrics this time. [20:59:25] (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1354:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240208T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P56564 and previous config saved to /var/cache/conftool/dbconfig/20240208-210110-root.json [21:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P56565 and previous config saved to /var/cache/conftool/dbconfig/20240208-210125-root.json [21:02:38] RoanKattouw, urbanecm, cjming, TheresNoTime, kindrobot: no patches for this window anyway, but just noting that i'm still mid-train. [21:05:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:06:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:06:30] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.17 refs T354435 [21:06:43] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [21:06:48] Hello there! I'm reading the scrollback and it seems to me that the issue has been resolved, but I'm not sure what's happening now. Could someone please give me a green light for running my script? [21:08:18] Daimona: not yet. i still need to roll train to group2. [21:09:31] Yup, makes sense, ty. Could you please poke me when it's all good? [21:09:40] Daimona: will do! [21:09:42] (SystemdUnitFailed) firing: (75) prometheus-phpfpm-statustext-textfile.service Failed on mw1354:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:46] Thanks! [21:09:56] (SystemdUnitFailed) firing: (76) prometheus-phpfpm-statustext-textfile.service Failed on mw1354:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:10:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:13:23] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.17 refs T354435 (duration: 06m 52s) [21:13:27] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [21:14:43] (SystemdUnitFailed) firing: (99) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:56] (SystemdUnitFailed) firing: (97) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:11] edit count looks unaffected, going ahead to all wikis. [21:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P56566 and previous config saved to /var/cache/conftool/dbconfig/20240208-211615-root.json [21:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P56567 and previous config saved to /var/cache/conftool/dbconfig/20240208-211630-root.json [21:19:43] (SystemdUnitFailed) resolved: (76) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:55] (SystemdUnitFailed) firing: (24) prometheus-phpfpm-statustext-textfile.service Failed on mw2284:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:45] (SystemdUnitFailed) firing: (76) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:34] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.17 refs T354435 [21:25:38] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [21:25:58] (SystemdUnitFailed) firing: (81) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:40] (SystemdUnitFailed) resolved: (65) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P56568 and previous config saved to /var/cache/conftool/dbconfig/20240208-213120-root.json [21:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P56569 and previous config saved to /var/cache/conftool/dbconfig/20240208-213135-root.json [21:32:04] brennen: i see train's in progress ATM – is there chance of deploying a config patch soon? [21:33:39] urbanecm: we're now on all wikis and things feel pretty stable, i'd say go ahead. [21:33:48] also Daimona was looking to run a script [21:35:23] ack, thanks [21:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:11] Oh, green light then? [21:37:03] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:999085|Echo: Use conditional defaults for 4 user properties (T353225)]] [21:37:08] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [21:38:30] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:999085|Echo: Use conditional defaults for 4 user properties (T353225)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:40:04] !log urbanecm@deploy2002 urbanecm: Continuing with sync [21:45:55] (SystemdUnitFailed) firing: (23) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:11] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:999085|Echo: Use conditional defaults for 4 user properties (T353225)]] (duration: 09m 07s) [21:46:15] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [21:46:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P56570 and previous config saved to /var/cache/conftool/dbconfig/20240208-214625-root.json [21:46:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P56571 and previous config saved to /var/cache/conftool/dbconfig/20240208-214640-root.json [21:49:40] (SystemdUnitFailed) resolved: (36) prometheus-phpfpm-statustext-textfile.service Failed on mw1351:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:02] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:999089|Echo: Conditional defaults: Fix start timestamp (T353225)]] [21:57:07] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [21:58:28] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:999089|Echo: Conditional defaults: Fix start timestamp (T353225)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:59:13] !log T357007 Running mwscript /home/daimona/GenerateInvitationList.php --wiki=metawiki --listfile=/home/daimona/list2.txt (same as current master) [21:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:17] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [22:00:10] !log urbanecm@deploy2002 urbanecm: Continuing with sync [22:05:56] Does anybody know if there's a recommended way to generate a speedscope file for a maintenance script in production? [22:06:31] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:999089|Echo: Conditional defaults: Fix start timestamp (T353225)]] (duration: 09m 29s) [22:06:37] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [22:06:40] (SystemdUnitFailed) firing: (12) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:55] (SystemdUnitFailed) resolved: (35) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:24] !log adding missing external-links group to AMS-IX peering port ae1.380 cr1-esams [22:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:18:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:20:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: racked and provision network restbase servers - jclark@cumin1002" [22:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:21:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: racked and provision network restbase servers - jclark@cumin1002" [22:21:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:24:21] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host restbase1034.mgmt.eqiad.wmnet with reboot policy FORCED [22:26:01] hrm. are these cirrusSearchElasticaWrite errors still expected? [22:26:04] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [22:27:00] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:31:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:38:38] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005*,cloudelastic1006*,cloudelastic1007*,cloudelastic1008* for IP migration - bking@cumin2002 - T355617 [22:38:42] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005*,cloudelastic1006*,cloudelastic1007*,cloudelastic1008* for IP migration - bking@cumin2002 - T355617 [22:38:42] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:41:18] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:43:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: racked and provision network restbase servers - jclark@cumin1002" [22:44:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: racked and provision network restbase servers - jclark@cumin1002" [22:44:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:45:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) [22:45:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10VRiley-WMF) [22:46:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) [22:47:22] (03PS1) 10Majavah: openldap: cross-validate-accounts: Note shell users disabled in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/999103 [22:51:57] !log made a stupid mistake and accidentally installed knot & unbound on dns1004, based on logs I don't think any harm was caused, they have since been removed [22:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [22:58:03] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [23:10:04] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:11:08] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/999122 [23:11:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999122 (owner: 10JHathaway) [23:15:23] (03PS2) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/999122 [23:15:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999122 (owner: 10JHathaway) [23:17:06] !log removing two files for legal compliance [23:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:36] (03CR) 10JHathaway: [C: 03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/999122 (owner: 10JHathaway) [23:28:56] !log removing one file for legal compliance [23:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:38] (03PS1) 10Volans: sre.hosts.reimage: fix module docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/999131 [23:35:40] (03PS1) 10Volans: sre.hosts.provision: fix but when running from VM [cookbooks] - 10https://gerrit.wikimedia.org/r/999132 [23:36:38] (03PS2) 10Volans: sre.hosts.reimage: fix module docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/999131 [23:36:40] (03PS2) 10Volans: sre.hosts.provision: fix but when running from VM [cookbooks] - 10https://gerrit.wikimedia.org/r/999132 [23:41:53] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix module docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/999131 (owner: 10Volans) [23:42:18] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock DCops for now. Happy to adapt it later." [cookbooks] - 10https://gerrit.wikimedia.org/r/999132 (owner: 10Volans) [23:49:48] 10SRE, 10MW-on-K8s, 10Trust and Safety Product Team, 10serviceops-radar, 10Patch-For-Review: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) [23:50:21] !log removing 14 files for legal compliance [23:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:09] (03CR) 10Volans: [V: 03+2 C: 03+2] "gate and submit job succeeded but did not report back here in more than 5m, bypassing it" [cookbooks] - 10https://gerrit.wikimedia.org/r/999131 (owner: 10Volans) [23:53:32] (03CR) 10Volans: [V: 03+2 C: 03+2] "gate and submit job succeeded but did not report back here in more than 5m, bypassing it" [cookbooks] - 10https://gerrit.wikimedia.org/r/999132 (owner: 10Volans) [23:56:55] !log volans@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [23:57:39] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL