[00:14:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50390 and previous config saved to /var/cache/conftool/dbconfig/20230810-001414-ladsgroup.json [00:14:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:14:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:14:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50391 and previous config saved to /var/cache/conftool/dbconfig/20230810-001437-ladsgroup.json [00:23:23] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:25:29] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50392 and previous config saved to /var/cache/conftool/dbconfig/20230810-002648-ladsgroup.json [00:26:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 [00:38:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 (owner: 10TrainBranchBot) [00:41:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P50393 and previous config saved to /var/cache/conftool/dbconfig/20230810-004154-ladsgroup.json [00:43:11] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:45] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:44:01] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:44:01] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:54:43] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 (owner: 10TrainBranchBot) [00:57:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P50394 and previous config saved to /var/cache/conftool/dbconfig/20230810-005701-ladsgroup.json [01:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50395 and previous config saved to /var/cache/conftool/dbconfig/20230810-010212-ladsgroup.json [01:02:19] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50396 and previous config saved to /var/cache/conftool/dbconfig/20230810-011207-ladsgroup.json [01:12:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [01:12:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:12:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [01:12:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50397 and previous config saved to /var/cache/conftool/dbconfig/20230810-011228-ladsgroup.json [01:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P50398 and previous config saved to /var/cache/conftool/dbconfig/20230810-011718-ladsgroup.json [01:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P50399 and previous config saved to /var/cache/conftool/dbconfig/20230810-013225-ladsgroup.json [01:47:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50400 and previous config saved to /var/cache/conftool/dbconfig/20230810-014731-ladsgroup.json [01:47:35] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50401 and previous config saved to /var/cache/conftool/dbconfig/20230810-020012-ladsgroup.json [02:00:22] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P50402 and previous config saved to /var/cache/conftool/dbconfig/20230810-021518-ladsgroup.json [02:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:24:25] (03PS1) 10Mdaniels5757: add (I think even properly!) autopatrolled group with autopatrol right for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 [02:26:10] (03PS2) 10Mdaniels5757: add (I think even properly!) autopatrolled group with autopatrol right for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) [02:30:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P50403 and previous config saved to /var/cache/conftool/dbconfig/20230810-023025-ladsgroup.json [02:31:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50404 and previous config saved to /var/cache/conftool/dbconfig/20230810-024531-ladsgroup.json [02:45:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [02:45:36] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:45:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [03:27:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [03:27:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:01:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50405 and previous config saved to /var/cache/conftool/dbconfig/20230810-040104-ladsgroup.json [04:01:18] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P50406 and previous config saved to /var/cache/conftool/dbconfig/20230810-041610-ladsgroup.json [04:31:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P50407 and previous config saved to /var/cache/conftool/dbconfig/20230810-043116-ladsgroup.json [04:46:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50408 and previous config saved to /var/cache/conftool/dbconfig/20230810-044622-ladsgroup.json [04:46:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [04:46:27] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:46:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [04:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50409 and previous config saved to /var/cache/conftool/dbconfig/20230810-044643-ladsgroup.json [05:04:29] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:05] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:13:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [05:13:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:18] (03CR) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [05:21:20] (03CR) 10Muehlenhoff: [C: 03+2] profile::mirrors::serve: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [05:22:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:24:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:25:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1015.eqiad.wmnet [05:27:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast5004.wikimedia.org [05:27:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [05:29:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002" [05:30:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002" [05:30:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:30:27] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast5004.wikimedia.org on all recursors [05:30:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast5004.wikimedia.org on all recursors [05:30:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002" [05:31:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002" [05:32:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast5004.wikimedia.org with OS bookworm [05:32:25] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm [05:35:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [05:50:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50410 and previous config saved to /var/cache/conftool/dbconfig/20230810-055005-ladsgroup.json [05:50:09] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:51:45] (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945779 (owner: 10Muehlenhoff) [05:59:03] !log installing tiff security updates [05:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0600) [06:00:04] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0600). Please do the needful. [06:01:01] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P50411 and previous config saved to /var/cache/conftool/dbconfig/20230810-060511-ladsgroup.json [06:05:14] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [06:08:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [06:09:13] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:15:51] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P50412 and previous config saved to /var/cache/conftool/dbconfig/20230810-062017-ladsgroup.json [06:20:18] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [06:23:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [06:24:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:31] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50413 and previous config saved to /var/cache/conftool/dbconfig/20230810-063523-ladsgroup.json [06:35:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:35:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [06:35:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:35:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:36:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50414 and previous config saved to /var/cache/conftool/dbconfig/20230810-063611-ladsgroup.json [06:46:33] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:39] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:47:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:26] (03PS3) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [06:56:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [06:58:28] (03PS3) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [06:58:30] (03PS4) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [06:58:32] (03PS1) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 [07:00:04] Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0700). [07:00:17] morning! no trainees, no patches, no news. It's August! have a nice day everybody and we'll see you all next time. [07:02:12] (03CR) 10Muehlenhoff: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42814/console" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [07:04:08] (03PS2) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714 [07:05:32] (03PS4) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [07:05:43] (03PS5) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [07:06:55] (03PS1) 10Ayounsi: Enable sftp-server [homer/public] - 10https://gerrit.wikimedia.org/r/947715 (https://phabricator.wikimedia.org/T316544) [07:07:39] (Traffic bill over quota) resolved: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:11:55] (03PS5) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) [07:19:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast5004.wikimedia.org with OS bookworm [07:19:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast5004.wikimedia.org [07:19:30] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm executed with errors: - bast5004 (**FAIL**) - Removed from Puppet... [07:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:25:36] (03PS5) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [07:28:29] (03CR) 10JMeybohm: [C: 04-1] Update blubberoid to use certmanager certs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [07:36:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) Quick status update regarding Homer. With those 3 patches: * Initial OpenConfig/SONiC support to wmf-netbox - https://gerrit.wikimedia.org/... [07:44:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [07:48:24] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5004.wikimedia.org [07:48:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [07:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:52:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:56:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: update my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/947379 (owner: 10Giuseppe Lavagetto) [07:59:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:00:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:00:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:00:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5004.wikimedia.org [08:00:36] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5004.wikimedia.org` - bast5004.wikimedia.org (**WARN**) - //Host not found on Icinga, unable to downt... [08:11:01] <_joe_> jouncebot: nowandnext [08:11:01] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [08:11:01] In 1 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000) [08:11:01] In 1 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000) [08:11:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:13:59] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10MoritzMuehlenhoff) a:05fgiunchedi→03Eevans >>! In T342969#9080553, @adee_wmde wrote: >>>! In T342969#9080463, @MoritzMuehlenhoff wrote: >> @adee_wmde You are using the same key... [08:16:17] (03CR) 10Muehlenhoff: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:18:06] (03PS2) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) [08:19:47] (03CR) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [08:21:15] !log put back business hours americas for sre business hours escalation [08:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:37] !log put back business hours americas for sre business hours escalation - T343812 [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] T343812: On-call batphone escalation configuration holidays Aug 2023 - https://phabricator.wikimedia.org/T343812 [08:21:51] (03CR) 10JMeybohm: [C: 03+2] mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:22:50] (03Merged) 10jenkins-bot: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:26:42] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [08:28:37] (03PS4) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) [08:29:16] jouncebot: nowandnext [08:29:17] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [08:29:17] In 1 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000) [08:29:17] In 1 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000) [08:31:36] <_joe_> TheresNoTime: hold your horses [08:31:39] <_joe_> :) [08:31:52] * TheresNoTime isn't going to deploy anything ^^ [08:36:50] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [08:42:04] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3007.wikimedia.org [08:44:37] (03PS1) 10JMeybohm: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) [08:45:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm) [08:46:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:51:31] (03PS2) 10JMeybohm: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) [08:52:31] (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [08:53:48] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [08:55:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm) [08:57:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [08:58:20] I am doing some airflow maintenance and rebooting a postgresql server. I have tried to put downtime in for everything, but there might be a bit of noise. [08:58:53] (03PS6) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [09:00:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:00:56] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:01:58] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:03:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:03:50] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'CHUniZH' 'Musik CH' # T343867 [09:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:54] T343867: Unblock stuck global rename of Musik CH - https://phabricator.wikimedia.org/T343867 [09:04:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet [09:04:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1002.eqiad.wmnet [09:04:47] (03CR) 10JMeybohm: [C: 03+2] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm) [09:05:03] urbanecm: looks like we have quite a few stuck renames atm. are you fixing those too or should I? [09:05:14] taavi: yep, working on it. [09:05:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:05:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet [09:05:34] (03CR) 10Jbond: [C: 03+1] Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [09:05:36] (03Merged) 10jenkins-bot: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm) [09:06:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1004.eqiad.wmnet [09:06:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet [09:06:19] (03CR) 10Elukey: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:06:26] taavi: since you're here: i think it's a good idea to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947362/ soon. even though the list of IPs is not yet finalized, i think it's better to have the rule in place soon, and amend it as new info flows, rather than rushing the deployment seconds before Wikimania. what do you think? [09:06:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1004.eqiad.wmnet [09:07:32] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Garciajaysonpinolkwani98' 'Ne_Shokot_Pinolkwane' [09:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:37] yep, planning to do that today, after the current MW infra window or so [09:07:40] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=amwiki --logwiki=metawiki 'Jean-Mahmood' 'User92259453' [09:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] taavi: okay, awesome, thanks :) [09:08:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [09:08:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10cmooney) Amazing work! Looks great. >>! In T320638#9082582, @ayounsi wrote: > * The ordering can be problematic (`# TODO needs to happen after the... [09:08:53] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Mittzy' 'Mittzy (usurped)' [09:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:00] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=arwiki --logwiki=metawiki 'Qwertyoruiop' '3h6 1' [09:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:19] hmm, also I just noticed all of the stuck renames were done via Special:GlobalRenameUser and not via the queue. that makes me worried I broke something in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/934384, but I don't see anything [09:09:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1002.eqiad.wmnet [09:09:54] taavi: hmm...let me test that [09:10:24] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:12:12] started https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Martin_Urbanec_(test_10-renamed), it started immediately [09:12:30] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [09:12:44] so did a rename back [09:15:14] urbanecm: what happens if you input the username in a non-canonical format? so replace a space with an underscore, or a lowercase first letter, or similar [09:15:24] that was the rename back [09:15:29] but i can try other non-cannonical formats [09:19:28] taavi: i managed to break it, but in a different way. [09:19:36] https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10), https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10_renamed-02) [09:20:04] and https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10-renamed) [09:20:30] oops [09:20:43] the problem is i don't know how i broke it... trying more. [09:22:22] (03PS2) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425 [09:22:43] (03CR) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [09:23:19] in the meantime, I can fairly reliably reproduce the "jobs get lost" issue locally if the target username is in a non-canonical format. I'll update the task and see if I can come up with a fix [09:23:32] and reverting my patch does indeed fix the issue [09:23:34] (03PS4) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) [09:23:46] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:27] taavi: yeah, and renaming one account twice seems to cause the other bug. [09:25:29] filling task... [09:27:42] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) p:05Triage→03Medium [09:29:00] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Jelto) [09:29:01] filled T343956 [09:29:02] T343956: Renaming global account to non-canonical form causes rename jobs to be post - https://phabricator.wikimedia.org/T343956 [09:29:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Jelto) [09:32:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [09:33:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:33:35] and T343958 [09:33:35] T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958 [09:33:46] taavi: does reverting the patch fix both issues? [09:34:17] (i'd test, but i don't have CA set up (yet?) on my work laptop, and i don't have my personal laptop nearby atm) [09:34:25] (03CR) 10JMeybohm: [C: 03+1] Update blubberoid to use certmanager certs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:34:27] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDZMvLWML3HYfq2Tc1TvfUFInGtmN8DS01pcdYDetuiCklmTUFuRwYfeIhevlpwFKxauefEDs04YH/i0aupTfrGfORRtS/qLhn8lSQY3z73c/XlMOYwozfHeojc... [09:36:23] urbanecm: it does at least for the first one, but I think I have a one-line patch for the first one [09:36:29] will look at the second one after I'm done testing this [09:36:40] okay, ty. [09:37:36] * urbanecm leaves the accounts broken for now; i'll fix them once we fix the problem. [09:38:33] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10jbond) >>! In T341973#9049479, @bking wrote: > Swift > - CON: [[ https://platform.swiftstack.com/docs/introduction/openstack_swift.html#mass... [09:41:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944870 (owner: 10Muehlenhoff) [09:42:00] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/947794/ [09:44:19] (03CR) 10Stevemunene: airflow-wmde: configure wmde airflow instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:46:05] +2'ed. [09:49:21] unable to reproduce the second bug locally, could you clarify which usernames you're trying to rename at each step? [09:50:04] (03CR) 10Ayounsi: [C: 03+1] Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [09:52:31] taavi: clarified the steps, according to my notes of what i did. [09:52:52] (it might be something specific to WMF infra that's not present locally, theoretically) [09:53:02] (03CR) 10Jbond: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:53:27] (03PS1) 10Urbanecm: GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956) [09:54:02] (03CR) 10Ssingh: [C: 03+2] P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [09:54:05] (03CR) 10David Caro: [V: 03+1 C: 03+2] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:54:07] (03CR) 10Ssingh: [C: 03+2] bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412 (owner: 10Ssingh) [09:54:26] dcaro: ok to merge yours? [09:54:32] sukhe: yes please :) [09:54:33] David Caro: prometheus: gather stats from haproxy for openstack and cloudlb (b6592cf212) [09:54:36] thanks [09:55:55] urbanecm: thanks, reproduced locally [09:56:03] 👍 [09:56:59] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) a:05RickiJay-WMDE→03None [09:57:25] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10LSobanski) [09:59:01] (03CR) 10Btullis: "I think that the way I would tackle this is to try to avoid duplication." [puppet] - 10https://gerrit.wikimedia.org/r/947714 (owner: 10Stevemunene) [10:00:04] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000). [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000) [10:07:16] (03PS8) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [10:07:24] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [10:07:43] (03CR) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [10:09:37] (03CR) 10Effie Mouzeli: [C: 03+2] Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:10:22] (03Merged) 10jenkins-bot: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:10:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [10:10:35] BGP/BFD alerts expected in drmrs [10:12:05] (03PS1) 10EoghanGaffney: gitlab: Add missing options for objectstore and extract swift key [puppet] - 10https://gerrit.wikimedia.org/r/947798 [10:13:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [10:13:37] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42815/console" [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney) [10:14:50] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:15:00] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:03] expected [10:16:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1090.eqiad.wmnet with OS bullseye [10:17:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [10:17:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet [10:21:32] (JobUnavailable) firing: Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet [10:26:33] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:08] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1090.eqiad.wmnet with reason: host reimage [10:30:38] urbanecm: found the other issue too! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/947799 [10:32:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1090.eqiad.wmnet with reason: host reimage [10:32:43] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [10:32:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [10:33:46] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [10:34:57] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [10:36:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [10:36:16] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [10:42:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.571 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:19] taavi: thanks for fixing both issues! Commented on the patch; the explanation in the commit message should probably be on the task as well, to make it easier to link in code comments/etc (this seems likely to happen again when someone decides to refactor things). [10:44:40] Will test once I get to my personal laptop, unless someone beats me :) [10:45:46] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [10:46:20] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [10:46:32] (03CR) 10Jbond: [C: 04-1] "idea looks good but minor bug" [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:47:54] urbanecm: thanks, fixed and will do [10:48:08] (03PS1) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) [10:55:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1090.eqiad.wmnet with OS bullseye [10:58:28] (03PS2) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) [10:58:52] (03CR) 10CI reject: [V: 04-1] Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:00:13] (03PS1) 10Effie Mouzeli: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947802 (https://phabricator.wikimedia.org/T300033) [11:00:15] (03PS3) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) [11:04:31] (03CR) 10Jbond: [C: 03+1] "minor optional follow up comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:04:49] (03CR) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:06:36] (03PS1) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947804 (https://phabricator.wikimedia.org/T300033) [11:09:00] (03CR) 10Effie Mouzeli: [C: 04-1] "I am not sure this is correct, needs a little more thought" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947802 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [11:09:11] (03PS2) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) [11:09:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:11:40] (03Abandoned) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947804 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [11:12:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:12:24] (03PS1) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033) [11:13:36] (03PS2) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) [11:13:57] (03PS1) 10Ssingh: wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) [11:14:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1091.eqiad.wmnet with OS bullseye [11:14:11] (03CR) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:14:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:14:54] (03CR) 10CI reject: [V: 04-1] wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [11:17:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:17:59] (03PS2) 10Ssingh: wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) [11:18:17] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:18:55] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff) [11:20:47] (03CR) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:20:52] (03PS1) 10Ssingh: hiera: update v4 IP for ns2 [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) [11:21:04] jouncebot: nowandnext [11:21:04] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [11:21:04] In 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1200) [11:21:34] (03PS1) 10Btullis: Temporarily disable the gobblin jobs on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947811 (https://phabricator.wikimedia.org/T329363) [11:21:36] (03PS1) 10Btullis: Re-enable the gobblin timers on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947812 (https://phabricator.wikimedia.org/T329363) [11:21:50] I'll deploy some config patches and a backport [11:21:59] (03CR) 10Ssingh: [C: 04-1] "Do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [11:22:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah) [11:22:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah) [11:22:54] (03CR) 10Majavah: [C: 03+2] GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956) (owner: 10Urbanecm) [11:22:57] (03Merged) 10jenkins-bot: throttle: remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah) [11:22:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [11:22:59] (03Merged) 10jenkins-bot: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah) [11:23:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [11:23:20] !log taavi@deploy1002 Started scap: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]] [11:23:23] T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595 [11:23:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff) [11:24:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945755 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:24:55] !log taavi@deploy1002 taavi: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:26:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [11:27:18] (03CR) 10Btullis: [C: 03+2] Temporarily disable the gobblin jobs on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947811 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [11:27:23] !log taavi@deploy1002 taavi: Continuing with sync [11:27:35] (03Merged) 10jenkins-bot: GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956) (owner: 10Urbanecm) [11:28:05] (03Abandoned) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [11:28:38] (03PS1) 10Jaime Nuche: releases jenkins: allow Scap to disable services on secondary hosts [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447) [11:30:15] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:32:47] (03Abandoned) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/890385 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:32:49] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1001.eqiad.wmnet with OS bullseye [11:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:34:51] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]] (duration: 11m 30s) [11:34:55] T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595 [11:35:16] !log taavi@deploy1002 Started scap: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]] [11:35:19] T343956: Renaming global account to non-canonical form causes rename jobs to be lost - https://phabricator.wikimedia.org/T343956 [11:35:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:35:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:41] (03Abandoned) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/890384 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:36:48] !log taavi@deploy1002 taavi and urbanecm: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:36:59] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:47] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [11:38:34] (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:39:08] !log taavi@deploy1002 taavi and urbanecm: Continuing with sync [11:39:34] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add manufacture to network devices - jbond@cumin1001 - T329669" [11:39:37] T329669: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 [11:40:51] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add manufacture to network devices - jbond@cumin1001 - T329669" [11:41:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50415 and previous config saved to /var/cache/conftool/dbconfig/20230810-114108-ladsgroup.json [11:41:11] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:42:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [11:42:35] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1091.eqiad.wmnet with reason: host reimage [11:44:28] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:45:33] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]] (duration: 10m 17s) [11:45:36] T343956: Renaming global account to non-canonical form causes rename jobs to be lost - https://phabricator.wikimedia.org/T343956 [11:45:50] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: host reimage [11:45:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: host reimage [11:48:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: host reimage [11:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:53:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:53:38] (03PS1) 10Jbond: tlsproxy::envoy: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/947821 [11:55:11] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Although technically not the 'glue' record that's in the org zone not this wikimedia.org one :P" [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [11:56:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P50416 and previous config saved to /var/cache/conftool/dbconfig/20230810-115614-ladsgroup.json [11:58:14] (03PS1) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 [11:58:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:58:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:58:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:58:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:58:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3007.wikimedia.org [11:58:37] (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [11:58:42] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3007.wikimedia.org` - bast3007.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [12:00:04] (03PS2) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1200) [12:00:27] (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [12:00:49] (03PS3) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 [12:01:14] (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [12:02:54] (03CR) 10Ori: "Not tested." [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [12:04:35] (03CR) 10Jbond: [C: 03+1] "lgtm but see warning inline" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [12:04:45] (03PS4) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 [12:05:22] (03PS5) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (https://phabricator.wikimedia.org/T211661) [12:06:27] (03PS3) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) [12:08:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:08:38] checking [12:08:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1091.eqiad.wmnet with OS bullseye [12:08:50] !incidents [12:08:51] 3938 (UNACKED) NELHigh sre (tcp.timed_out) [12:08:51] 3937 (RESOLVED) ATSBackendErrorsHigh cache_text sre (miscweb.discovery.wmnet eqsin) [12:09:00] !ack 3938 [12:09:00] 3938 (ACKED) NELHigh sre (tcp.timed_out) [12:09:14] I don't see a spike yet on the logs [12:09:22] checking graphs [12:09:51] sustained since 12:01 [12:10:15] it is acked [12:10:18] origin? [12:10:27] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:10:38] that points to eqsin [12:11:03] had a previous spike at 08:56 too [12:11:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P50417 and previous config saved to /var/cache/conftool/dbconfig/20230810-121120-ladsgroup.json [12:12:07] yeah, nel points to text-lb.eqsin.wikimedia.org. as well for the tcp.timed_out [12:12:09] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#9068521, @Jelto wrote: > @jbond @SLyngshede-WMF do you have a idea how to change the name GitLab uses with O... [12:12:57] checking superset [12:13:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:13:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:13:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:13:49] TheresNoTime: thanks, will have a look at it too [12:14:14] jynus: (when you're not busy) which superset dash do you look at, just out of curiosity [12:14:38] yeah, later when we are out of the incident (even if it resolved) [12:14:39] (03PS1) 10Btullis: Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) [12:14:55] (ack) [12:15:43] I think I have it, but switching to private chanels [12:15:45] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:17:09] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Range is being accepted by Arelion according to their looking glass: ` Router: adm-b6 / Amsterdam (Iron Mountain, Haarlem) Command: show bg... [12:17:17] (03CR) 10CI reject: [V: 04-1] Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis) [12:22:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1092.eqiad.wmnet with OS bullseye [12:25:12] (03CR) 10Ssingh: [C: 04-1] "A bit unsure about this: the anycast IP already exists on lo so I am not sure if duplicating that is a good idea. Let's think a bit more." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [12:26:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50418 and previous config saved to /var/cache/conftool/dbconfig/20230810-122626-ladsgroup.json [12:26:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:26:29] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) HE also accepting and path I'm taking from home connection: ` core1.ams7.he.net> show ipv6 bgp routes detail 2a02:ec80:300::/48 Number... [12:26:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:26:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:29:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:46] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Reachable from VPS in the UK although not sure exactly how it's coming in to us: ` root@uk:~# mtr -z -b -w -c 10 2a02:ec80:300:ffff::187 St... [12:31:27] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:34:35] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Also accepted by Liberty Global. They also see a transit route via Tele2 (AS1257) so getting picked up there, as well as from Deutsche Tel... [12:34:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) [12:35:21] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) 05Open→03Resolved [12:38:02] (03PS1) 10Ladsgroup: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) [12:38:45] (03PS1) 10Btullis: Don't install python-is-python3 to presto servers [puppet] - 10https://gerrit.wikimedia.org/r/947824 (https://phabricator.wikimedia.org/T336281) [12:39:32] (03CR) 10Btullis: [C: 03+2] Don't install python-is-python3 to presto servers [puppet] - 10https://gerrit.wikimedia.org/r/947824 (https://phabricator.wikimedia.org/T336281) (owner: 10Btullis) [12:39:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947319 (owner: 10Muehlenhoff) [12:40:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff) [12:41:50] (03CR) 10Ayounsi: [C: 03+1] Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:42:09] (03PS4) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) [12:45:50] (03PS1) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) [12:46:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [12:46:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [12:47:19] (03CR) 10Ayounsi: BGPalerter: mute software-update notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [12:47:23] (03CR) 10Ayounsi: [C: 03+2] BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [12:49:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:54:08] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1092.eqiad.wmnet with reason: host reimage [12:57:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1092.eqiad.wmnet with reason: host reimage [12:57:35] (03PS2) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) [12:57:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10ayounsi) 05Open→03Resolved a:03jbond All done! Assigned to jbond as he did most of the work! [12:58:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42818/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1300). [13:00:04] Dreamy_Jazz and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "post-review typo I just noticed :|" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [13:00:12] \o [13:00:14] * TheresNoTime can deploy [13:01:15] Dreamy_Jazz: to confirm, you just need those scripts run? [13:01:26] Yes [13:01:30] ack [13:05:05] (wait one) [13:06:18] (03PS1) 10Giuseppe Lavagetto: python3: update to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947828 [13:06:20] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) [13:06:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3: update to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947828 (owner: 10Giuseppe Lavagetto) [13:06:41] `foreachwiki sql.php extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` returns `Unable to open input file`, looking.. [13:06:46] If the scripts don't work, my intention was to add the tables to all wikis except testwiki. [13:07:00] (03PS1) 10Cathal Mooney: Reverse DNS includes for new /24 ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) [13:07:10] As testwiki already has the table [13:07:53] ack [13:07:54] (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new /24 ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [13:08:14] looks like I need the full path, okay [13:09:29] !log `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` for T258105 [13:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:33] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney) [13:09:33] T258105: Implement storage for User-Agent Client Hints header data - https://phabricator.wikimedia.org/T258105 [13:10:25] (03CR) 10Jbond: [C: 03+1] firewall: Ship a base profile for the nftables provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:11:09] (03CR) 10Jbond: [C: 03+2] tlsproxy::envoy: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/947821 (owner: 10Jbond) [13:14:41] (03CR) 10Jbond: [C: 03+1] "lgtm excluding the ci issue" [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis) [13:14:51] !log `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints_map.sql` for T258105 [13:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] T258105: Implement storage for User-Agent Client Hints header data - https://phabricator.wikimedia.org/T258105 [13:15:22] Dreamy_Jazz: first script done, second running — I see the new table [13:15:26] Thanks. [13:15:32] (03PS1) 10Jelto: gitlab_runner: add sonar-scanner-cli image to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/947832 (https://phabricator.wikimedia.org/T343975) [13:15:41] <_joe_> James_F: sorry I just realized we never deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/945534 [13:15:45] `wikifunctionswiki` also already had the table, guessing that's expected [13:16:03] Not sure, but I only requested it on testwiki [13:16:18] <_joe_> jouncebot: now [13:16:18] For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1300) [13:16:45] _joe_: want me to let you know when I'm done? [13:16:46] <_joe_> TheresNoTime: can you ping me when you're done? [13:16:48] <_joe_> yes :) [13:16:49] hah, yes [13:16:50] <_joe_> lol [13:17:08] <_joe_> I want to sync that patch for wikifunctions [13:20:08] (noting that I'm aware there's a lot of `WARNING`s being generated in logstash while these scripts run) [13:20:14] Dreamy_Jazz: both scripts run, I can see both of the new tables [13:20:21] Thanks. [13:20:51] (03PS5) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) [13:21:11] (03PS2) 10Giuseppe Lavagetto: Add wikifunctions object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) [13:21:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:22:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1092.eqiad.wmnet with OS bullseye [13:22:29] (03Merged) 10jenkins-bot: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:22:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]] [13:22:48] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:24:04] (03PS1) 10Marostegui: install_server: Do not reimage db2188 [puppet] - 10https://gerrit.wikimedia.org/r/947836 [13:24:18] !log samtar@deploy1002 samtar: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:24:28] * TheresNoTime testing [13:24:42] _joe_: Thanks for the reminder! [13:24:57] if there’s enough time at the end of the window (after _joe_), I’d like to do T343980 if that’s okay [13:24:57] T343980: [IPM] Enable temporary accounts (IP Masking) on Beta Wikidata - https://phabricator.wikimedia.org/T343980 [13:25:09] unless urbanecm or anyone else objects to enabling temp accounts on another wiki [13:25:15] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2188 [puppet] - 10https://gerrit.wikimedia.org/r/947836 (owner: 10Marostegui) [13:26:05] <_joe_> James_F: I'm going to merge it now if there's space in this window [13:26:59] !log samtar@deploy1002 samtar: Continuing with sync [13:27:00] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10brennen) > But users may not want to have their full name (cn?) in GitLab displayed (for example Rando McRandomface Jr). So the ui... [13:27:06] (03CR) 10David Caro: [V: 03+1 C: 03+2] prometheus: gather stats from haproxy for openstack and cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [13:27:33] I'll be done after 936717 finishes syncing [13:29:38] (03PS3) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) [13:30:46] (03PS2) 10Ssingh: hiera: temporarily remove v4 IP for ns2 from authdns_addrs [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) [13:30:58] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42819/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:31:50] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42820/console" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [13:32:13] Lucas_WMDE: (bikeshedding, but..) is there a reason we're not considering enabling temp accounts to more beta cluster projects? [13:32:40] nothing in particular that I know of 🤷 [13:32:49] (i.e. enabling it on beta wikidata, why not enable it just on the majority of beta projects) [13:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:43] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]] (duration: 10m 58s) [13:33:49] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:33:56] _joe_: done, all yours [13:34:02] <_joe_> TheresNoTime: thanks [13:34:30] TheresNoTime: I just don’t think that’s my call to make ^^ [13:34:40] (03PS1) 10Andrew Bogott: Revert "Revert "cloudbackup200[12]: remove some spurious config from the last patch"" [puppet] - 10https://gerrit.wikimedia.org/r/947839 [13:35:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [13:35:23] Lucas_WMDE: [[WP:BOLD]] /sarcasm [13:35:28] lol [13:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50419 and previous config saved to /var/cache/conftool/dbconfig/20230810-133534-ladsgroup.json [13:35:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:35:46] (03Merged) 10jenkins-bot: Add wikifunctions object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [13:36:00] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] [13:36:04] T297815: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 [13:36:23] Wikipedia has Be bold. Commons has Don’t be bold. Wikitech has a secret third thing [13:36:51] (03CR) 10Ssingh: [V: 03+1] "cumin:O:dnsbox output: https://puppet-compiler.wmflabs.org/output/947810/42821/" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [13:37:16] (03CR) 10Ssingh: [V: 03+1] "Needs a second pair of eyes, please review the PCC as well." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [13:37:38] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:38:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:38:46] !log oblivian@deploy1002 oblivian: Continuing with sync [13:39:06] (03CR) 10Ssingh: [V: 03+1] "lo UNKNOWN 127.0.0.1/8 208.80.154.238/32 208.80.153.231/32 91.198.174.239/32 10.3.0.1/32 198.35.27.27/32 ::1/128" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [13:39:14] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "cloudbackup200[12]: remove some spurious config from the last patch"" [puppet] - 10https://gerrit.wikimedia.org/r/947839 (owner: 10Andrew Bogott) [13:43:05] (03PS1) 10Andrew Bogott: remove 'cluster:wmcs' from cloudbackup2xxx host config [puppet] - 10https://gerrit.wikimedia.org/r/947842 [13:44:04] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:44:53] (03PS4) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) [13:45:10] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] (duration: 09m 09s) [13:45:40] (03PS1) 10Ssingh: bird: create /etc/bird without relying on postint [puppet] - 10https://gerrit.wikimedia.org/r/947843 [13:45:40] T297815: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 [13:46:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42822/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:46:42] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:47:18] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42823/console" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [13:47:26] !log depool and stop puppet on ms-fe2009 to test updated rewrite.py T211661 [13:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:30] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [13:47:36] (03CR) 10Andrew Bogott: [C: 03+2] remove 'cluster:wmcs' from cloudbackup2xxx host config [puppet] - 10https://gerrit.wikimedia.org/r/947842 (owner: 10Andrew Bogott) [13:49:04] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P50420 and previous config saved to /var/cache/conftool/dbconfig/20230810-135040-ladsgroup.json [13:51:53] (03PS1) 10Andrew Bogott: Set profile::wmcs::backy2::backup_time in cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947846 [13:51:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [13:52:52] !log restart puppet and repool ms-fe2009 after testing T211661 [13:52:53] (03PS1) 10Lucas Werkmeister (WMDE): Enable IP Masking on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) [13:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [13:53:45] _joe_: are you done deploying? [13:54:02] <_joe_> Lucas_WMDE: duh sorry yes it was a single paltch [13:54:10] ok! [13:54:22] TheresNoTime: want to give https://gerrit.wikimedia.org/r/947847 a quick +1 before I deploy it? 🥺 [13:54:59] (03CR) 10Samtar: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE)) [13:55:03] (03CR) 10Andrew Bogott: [C: 03+2] Set profile::wmcs::backy2::backup_time in cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947846 (owner: 10Andrew Bogott) [13:55:05] ^^ [13:55:11] thx [13:55:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) [13:55:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE)) [13:55:58] (03CR) 10MVernon: "I've tested this on ms-fe2009 (depooled, copied into place, restarted swift-proxy, ran rewrite_integration_test.py having made that +x loc" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [13:56:25] jouncebot: next [13:56:25] In 2 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1600) [13:56:28] ok phew [13:56:28] (03Merged) 10jenkins-bot: Enable IP Masking on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE)) [13:56:31] it’ll probably overrun a bit [13:56:46] oh wait, no it won’t [13:56:52] no scap sync for beta-only changes ^^ [13:57:06] * Lucas_WMDE done [13:57:50] !log UTC afternoon backport+config window done [13:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [13:58:14] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) @MoritzMuehlenhoff you nailed it. Got that updated for you. Can you confirm that it's working as expected now? [13:58:42] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [14:01:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1001.eqiad.wmnet with OS bullseye [14:02:19] (03CR) 10Ayounsi: [C: 03+1] "overall lgtm but 1 comment" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [14:02:24] (03CR) 10MVernon: [C: 03+1] "[I've looked at this one and it seems right to me; you already have +1s on all of these similar changes, so I don't propose to re-review t" [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [14:02:55] I'll take the quiet moment to push out a few more WF patches. [14:03:11] (03PS3) 10Jforrester: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) [14:03:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [14:03:47] (03CR) 10Ssingh: [V: 03+1] bird: create /etc/bird without relying on postint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [14:03:56] (03Merged) 10jenkins-bot: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [14:04:10] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]] [14:04:18] T342868: Add oathauth-enable to wikifunctions-staff - https://phabricator.wikimedia.org/T342868 [14:05:42] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:05:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P50421 and previous config saved to /var/cache/conftool/dbconfig/20230810-140546-ladsgroup.json [14:06:07] !log jforrester@deploy1002 jforrester: Continuing with sync [14:06:33] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "on PTO, but this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:11:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "on PTO, but this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:12:45] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]] (duration: 08m 35s) [14:12:54] T342868: Add oathauth-enable to wikifunctions-staff - https://phabricator.wikimedia.org/T342868 [14:12:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff) [14:13:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester) [14:13:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:36] (03PS2) 10Jforrester: Wikifunctions: Tell WikiLambda to stash results in our bespoke cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) [14:13:39] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester) [14:14:21] (03Merged) 10jenkins-bot: Wikifunctions: Tell WikiLambda to stash results in our bespoke cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester) [14:14:38] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]] [14:14:41] T342753: Add MW caching for Wikifunctions functions calls into Wikimedia production - https://phabricator.wikimedia.org/T342753 [14:16:06] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:16:17] !log jforrester@deploy1002 jforrester: Continuing with sync [14:16:33] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:42] (03PS8) 10Herron: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) [14:17:41] (03PS2) 10Jforrester: wikifunctions: Allow transwiki import from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang) [14:18:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:58] (03CR) 10Herron: thanos-fe: switch to cfssl (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [14:20:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50422 and previous config saved to /var/cache/conftool/dbconfig/20230810-142053-ladsgroup.json [14:20:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:21:00] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:21:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50423 and previous config saved to /var/cache/conftool/dbconfig/20230810-142117-ladsgroup.json [14:22:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:53] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]] (duration: 08m 15s) [14:22:58] T342753: Add MW caching for Wikifunctions functions calls into Wikimedia production - https://phabricator.wikimedia.org/T342753 [14:23:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang) [14:25:44] (03Merged) 10jenkins-bot: wikifunctions: Allow transwiki import from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang) [14:25:58] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]] [14:26:01] T343365: Allow transwiki import from Wikidata to Wikifunctions - https://phabricator.wikimedia.org/T343365 [14:26:33] (03CR) 10Ayounsi: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [14:27:29] !log jforrester@deploy1002 stang and jforrester: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:28:17] !log jforrester@deploy1002 stang and jforrester: Continuing with sync [14:30:04] (03PS2) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:30:08] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Add missing options for objectstore and extract swift key [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney) [14:31:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) >>! In T342159#9067390, @ssingh wrote: >>>! In T342159#9025176, @RobH wrote: >> Please note parent task 341588 has the range of cp1[090-105] however, cp1090 is already live/in us... [14:34:27] (03PS3) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:35:21] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]] (duration: 09m 22s) [14:35:24] T343365: Allow transwiki import from Wikidata to Wikifunctions - https://phabricator.wikimedia.org/T343365 [14:36:34] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:36] (All done.) [14:36:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:37:32] <_joe_> James_F: I don't see any use of the caches though [14:37:41] (03PS4) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) [14:38:14] (03CR) 10Hnowlan: thumbor: remove thumbor server configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:38:22] (03PS6) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 [14:38:37] (03PS5) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:39:20] (03CR) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [14:40:31] (03PS1) 10Ayounsi: esams/knams: stop anycast advertisments [homer/public] - 10https://gerrit.wikimedia.org/r/947856 [14:40:43] (03CR) 10Effie Mouzeli: [C: 03+1] "This is a historic moment." [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:41:35] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:41:50] (03PS1) 10Btullis: Create component/libmysql-java for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/947857 (https://phabricator.wikimedia.org/T329363) [14:42:32] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis) [14:44:56] (03CR) 10David Caro: [C: 03+1] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt) [14:47:27] (03PS6) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:47:54] (03CR) 10CI reject: [V: 04-1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:50:04] (03PS7) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:50:29] (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:53:13] (03PS2) 10Btullis: Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) [14:56:38] (03CR) 10Btullis: Use a routable email address for sending kerberos details (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis) [14:59:11] (03CR) 10MVernon: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [14:59:50] (03CR) 10JHathaway: [C: 03+1] profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [15:01:33] (03CR) 10Btullis: [C: 03+2] Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis) [15:02:53] (03CR) 10MVernon: "Is the intention here to make similar changes to the ms swift proxies? I'd like to avoid more skew developing between thanos-swift-config " [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [15:03:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10jijiki) [15:05:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10jijiki) [15:06:21] (03CR) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [15:07:40] _joe_: Hmm, it seems to be working from the application level, at least. [15:09:31] (03CR) 10Andrew Bogott: [C: 03+2] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt) [15:10:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:06] (03PS1) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) [15:13:18] <_joe_> James_F: do you have one key? [15:13:38] Not yet. [15:13:57] (03CR) 10Herron: thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [15:14:09] <_joe_> I suspect for some reason we're sending requests to the wrong cluster [15:14:49] (03CR) 10Jbond: "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [15:15:02] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [15:15:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:56] (03CR) 10MVernon: [C: 03+2] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori) [15:16:13] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:29] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:41] (03CR) 10Jbond: "As a matter of process this should get sign of from Nicholas as the group approver" [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt) [15:19:28] (03PS1) 10Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) [15:19:41] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:45] PROBLEM - Disk space on config-master2001 is CRITICAL: DISK CRITICAL - free space: /run 145MiB (99% inode=0%): /run/credentials 145MiB (99% inode=0%): /run/systemd/incoming 145MiB (99% inode=0%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master2001&var-datasource=codfw+prometheus/ops [15:20:58] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [15:21:19] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10jbond) This change has already been merged however as a matter of process the following approvals should have been collected on this ticket > - access request (or expans... [15:21:34] (03PS1) 10JMeybohm: CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978) [15:21:36] (03PS1) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [15:22:19] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10jijiki) [15:23:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [15:25:03] PROBLEM - Disk space on config-master1001 is CRITICAL: DISK CRITICAL - free space: /run 145MiB (99% inode=0%): /run/credentials 145MiB (99% inode=0%): /run/systemd/incoming 145MiB (99% inode=0%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master1001&var-datasource=eqiad+prometheus/ops [15:25:32] (03CR) 10Jbond: [C: 03+1] thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [15:27:05] (03PS2) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [15:27:07] _joe_: Not sure how to tell from the shell which BagOStuff I got back. [15:27:23] <_joe_> James_F: check the prefix [15:27:26] <_joe_> the routing prefix [15:27:30] (03PS1) 10JHathaway: dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) [15:28:06] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:28:25] <_joe_> James_F: how did you get the BagOfStuff? [15:28:49] _joe_: `use MediaWiki\Extension\WikiLambda\WikiLambdaServices; WikiLambdaServices::getZObjectStash();` [15:28:52] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:29:25] 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster thumbor1005, thumbor1006 to kubernetes1057 and kubernetes1058 - https://phabricator.wikimedia.org/T343993 (10jijiki) [15:29:47] 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10jijiki) [15:30:33] <_joe_> ["routingPrefix":protected]=> [15:30:33] 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10jijiki) [15:30:34] <_joe_> string(9) "/local/wf" [15:30:37] <_joe_> so that is correct [15:30:44] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [15:30:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe [15:30:59] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:42] _joe_: Then hopefully it's going to the right place? [15:32:26] It's definitely getting cached somewhere. [15:32:59] <_joe_> James_F: I'm trying to understand that :) [15:33:01] E.g. if I go to https://www.wikifunctions.org/wiki/Z801 and enter the string Wikimedia it echos back correctly and says it did so 25 mins ago (cached result from my testing). [15:33:07] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Andrew) We discussed and supported this during the wmcs weekly meeting. Nicholas is on vacation and I'm approving as his proxy. [15:34:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:35:05] 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki) [15:35:26] 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki) [15:36:15] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Andrew) >>! In T343862#9084179, @jbond wrote: > This change has already been merged however as a matter of process the following approvals should have been collected on t... [15:36:35] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [15:36:38] 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki) [15:39:45] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10jbond) >>! In T343862#9084235, @Andrew wrote: > We discussed and supported this during the wmcs weekly meeting. Nicholas is on vacation and I'm approving as his proxy. a... [15:40:26] (03CR) 10Filippo Giunchedi: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [15:42:15] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:42:59] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:36] (03CR) 10Filippo Giunchedi: [C: 03+1] dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:45:47] (03CR) 10JHathaway: [C: 03+2] dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:48:26] (03PS2) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) [15:51:27] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:06] (03PS1) 10Andrew Bogott: Remove yet more unneeded config from cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947876 [16:00:06] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1600). Please do the needful. [16:00:06] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:19] dancy: hey, really sorry to miss you on Tuesday, I was out sick [16:00:27] No problem! [16:00:32] sorry to hear you were ill! [16:00:36] (03CR) 10RLazarus: [C: 03+2] Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [16:02:38] (03CR) 10Andrew Bogott: [C: 03+2] Remove yet more unneeded config from cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947876 (owner: 10Andrew Bogott) [16:02:39] dancy: manual puppet run on the mwlog hosts, right? [16:02:55] Yes please. [16:04:52] dancy: done, have a look [16:09:10] (03PS1) 10Btullis: Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) [16:09:12] (03PS1) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363) [16:09:54] (03CR) 10Btullis: [C: 03+2] Create component/libmysql-java for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/947857 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:10:16] rzl: Everything still works. Thanks! [16:11:01] (03PS3) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [16:11:05] 👍 [16:13:27] (03CR) 10Eevans: [C: 03+2] admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) (owner: 10Eevans) [16:14:47] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [16:15:05] (03CR) 10Fabfur: [C: 03+1] "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh) [16:15:19] !log running authdns-update to update ns2 and point it to nsa.wikimedia.org [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) Hi @darthmon_wmde, this should now be complete. I'll close the issue, but don't hesitate to reopen if you have any issues! [16:23:33] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:38] (03PS1) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 [16:24:10] (03PS5) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 [16:26:20] (03PS2) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363) [16:32:26] (03PS2) 10Btullis: Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) [16:32:28] (03PS3) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363) [16:32:30] (03PS1) 10Eevans: admin: update ssh key for user adri [puppet] - 10https://gerrit.wikimedia.org/r/947884 (https://phabricator.wikimedia.org/T342969) [16:34:08] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42827/console" [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:35:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:37:08] (03CR) 10Eevans: [C: 03+2] admin: update ssh key for user adri [puppet] - 10https://gerrit.wikimedia.org/r/947884 (https://phabricator.wikimedia.org/T342969) (owner: 10Eevans) [16:38:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10Eevans) 05Open→03Resolved Done! [16:39:36] (03PS1) 10Andrew Bogott: eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495) [16:40:03] (03CR) 10JHathaway: "@jbond I think this is ready for inclusion, if you could help me with pushing a release that would be much appreciated!" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [16:40:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495) (owner: 10Andrew Bogott) [16:40:40] (03CR) 10Andrew Bogott: [C: 03+2] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495) (owner: 10Andrew Bogott) [16:41:31] (03PS1) 10Ayounsi: drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 [16:42:24] (03CR) 10Cathal Mooney: [C: 03+1] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi) [16:42:28] (03CR) 10Ayounsi: [C: 03+2] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi) [16:42:33] (03CR) 10Ssingh: [C: 03+1] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi) [16:44:47] (03PS2) 10Eevans: admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) [16:45:13] (03CR) 10Ssingh: [V: 03+1] bird: create /etc/bird without relying on postint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [16:46:04] (03CR) 10Eevans: [C: 03+2] admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) (owner: 10Eevans) [16:48:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) 05Open→03Resolved Hi @roti_WMDE, this should now be complete. I am closing the ticket, but don't hesitate to reopen if you have any problems. [16:48:48] (03PS2) 10Ssingh: bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843 [16:49:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:05] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) 05Open→03Resolved [16:50:14] (03CR) 10Btullis: [C: 03+2] Re-enable the gobblin timers on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947812 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:50:30] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42828/console" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [16:51:19] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) a:05Eevans→03Tsevener [16:53:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:33] (03PS8) 10JHathaway: site.pp: Drop top level domain names: .wmnet .org [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) [16:57:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [16:59:45] (03PS9) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [17:00:04] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1700) [17:00:28] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:18] (03PS10) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [17:03:22] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:54] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [17:06:13] (03CR) 10Jbond: [C: 03+1] Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [17:07:32] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:34] RECOVERY - cinder-volume process on cloudcontrol1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:08:53] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org pages should have a "who to contact" link - https://phabricator.wikimedia.org/T344000 (10Legoktm) > Nobody knew the answer. I find this hard to believe given we've worked with multiple functionaries on different list issues. In any case, you found the rig... [17:12:51] (03CR) 10Majavah: replica_cnf_api: add envvars backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [17:18:00] (03CR) 10Ayounsi: [C: 03+1] bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [17:19:23] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947832 (https://phabricator.wikimedia.org/T343975) (owner: 10Jelto) [17:19:38] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh) [17:21:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [17:21:45] (03PS11) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [17:25:58] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:26:02] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:26:08] expected ^ [17:43:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [17:46:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [17:48:10] (03CR) 10Muehlenhoff: [C: 03+1] "Awesome :-)" [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [17:56:11] PROBLEM - SSH on config-master2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:57:13] RECOVERY - SSH on config-master2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:05:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:06:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T342617)', diff saved to https://phabricator.wikimedia.org/P50426 and previous config saved to /var/cache/conftool/dbconfig/20230810-180656-ladsgroup.json [18:06:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:07:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:08:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.438 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:08:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:45] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:10:25] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:10:40] (03PS1) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 [18:10:56] jouncebot: nowandnext [18:10:56] No deployments scheduled for the next 1 hour(s) and 49 minute(s) [18:10:56] In 1 hour(s) and 49 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T2000) [18:11:19] (03CR) 10CI reject: [V: 04-1] ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm) [18:12:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [18:12:38] (03PS2) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 [18:14:01] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:11] (03PS3) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 [18:14:19] (03CR) 10Urbanecm: [C: 03+2] ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm) [18:15:11] (03Merged) 10jenkins-bot: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm) [18:15:37] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:947903|ltwiki: Disable Growth features]] [18:16:33] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:08] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:947903|ltwiki: Disable Growth features]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [18:18:57] !log urbanecm@deploy1002 urbanecm: Continuing with sync [18:21:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2007.codfw.wmnet with OS bullseye [18:24:25] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:947903|ltwiki: Disable Growth features]] (duration: 10m 05s) [18:26:01] * urbanecm done [18:26:22] (03PS1) 10Legoktm: admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905 [18:35:13] (03CR) 10Ssingh: [C: 03+1] admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905 (owner: 10Legoktm) [18:36:44] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [18:38:22] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: host reimage [18:43:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:43:43] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: host reimage [18:46:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:57] the uncomitted DNS changes are for the ganeti [18:47:00] +ganeti02 1H IN A 10.80.1.18 [18:47:03] +18 1H IN PTR ganeti02.svc.esams.wmnet. [18:47:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:35] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@4312d99]: (no justification provided) [18:55:56] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@4312d99]: (no justification provided) (duration: 00m 20s) [18:59:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:06] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:04] sukhe: ganeti02 is fine to merge, just prep work for the upcoming knams installation, Cathal created it earlier [19:08:28] (03PS5) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [19:13:00] (03PS1) 10Bking: query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) [19:14:42] moritzm: ok merging [19:14:56] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:15:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:16:49] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge ganeti changes - sukhe@cumin2002" [19:18:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge ganeti changes - sukhe@cumin2002" [19:18:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:04] (03PS2) 10Bking: query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) [19:22:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:22:34] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:24:18] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@b5a1d04]: (no justification provided) [19:24:28] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@b5a1d04]: (no justification provided) (duration: 00m 09s) [19:26:32] (03CR) 10Ryan Kemper: [C: 03+1] query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:26:37] (03CR) 10Bking: [C: 03+2] query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:28:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:28:35] (03CR) 10JHathaway: "@jbond this is ready to merge, if you could take another pass, that would be appreciated!" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway) [19:29:45] (03PS1) 10Bking: wdqs.data-transfer: ensure data_loaded file is created [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) [19:30:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:34] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:32:18] (03PS2) 10Bking: wdqs.data-transfer: ensure data_loaded file is created [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) [19:32:25] (03CR) 10Urbanecm: [C: 04-1] [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [19:33:12] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:37:28] PROBLEM - Disk space on cloudbackup2002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%): /tmp 0 MB (0% inode=93%): /var/tmp 0 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2002&var-datasource=codfw+prometheus/ops [19:38:25] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:43:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:49:25] (03PS1) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) [19:52:08] (03PS2) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) [19:53:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42830/console" [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [19:54:32] (03PS3) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) [19:56:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42831/console" [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [19:58:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:00:04] brennen and TheresNoTime: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T2000). [20:01:29] (nothing to deploy) [20:01:34] (yay) [20:01:43] Enjoy your evening then [20:02:42] (03CR) 10Majavah: [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [20:03:01] urbanecm: did you get a chance to review the centralauth patch yet? I was hoping to backport that one today too [20:03:24] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:03:26] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:09:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:16:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:18:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:18:43] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) In the interest of moving this forward, I'm going to go ahead and start provisioning these VMs. If there is a resource shortage in CODFW (or o... [20:19:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:23:25] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:28:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:33:18] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:34:03] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:34:07] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177 [20:34:23] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 16s) [20:37:24] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177 [20:38:07] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 42s) [20:38:24] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:09] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) [20:40:56] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:41:44] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) [20:42:56] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:20] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:42] RECOVERY - Disk space on cloudbackup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2002&var-datasource=codfw+prometheus/ops [21:02:50] taavi: sorry, not yet. I'll look in 20 mins. [21:03:35] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:06:24] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) SSH key verified against [[ https://meta.wikimedia.org/w/index.php?title=User:Ricki_Jay_(WMDE)&oldid=25435044 | https://meta.wikimedia.org/w/index.php?title=User:Ricki... [21:07:23] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) [21:08:24] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:08:40] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) @KFrancis can you confirm we have an NDA on file? [21:13:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:18:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:18:35] (03PS1) 10Cathal Mooney: Depool esams for duration of esams -> knams migration [dns] - 10https://gerrit.wikimedia.org/r/947945 (https://phabricator.wikimedia.org/T329219) [21:21:44] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2007.codfw.wmnet with OS bullseye [21:22:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50428 and previous config saved to /var/cache/conftool/dbconfig/20230810-212241-ladsgroup.json [21:22:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:33:10] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:18] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:37:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P50429 and previous config saved to /var/cache/conftool/dbconfig/20230810-213747-ladsgroup.json [21:38:18] (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:39:21] (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:40:51] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Tsevener) @Eevans Here you go, thanks! https://www.mediawiki.org/wiki/User:TSevener_(WMF) [21:44:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:45:24] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:32] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) a:05Tsevener→03Eevans [21:49:26] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:52:28] (03PS1) 10Eevans: admin: add user tsev to group restricted [puppet] - 10https://gerrit.wikimedia.org/r/947957 (https://phabricator.wikimedia.org/T343596) [21:52:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P50430 and previous config saved to /var/cache/conftool/dbconfig/20230810-215253-ladsgroup.json [21:54:02] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:54:29] (03PS1) 10Urbanecm: GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958) [21:54:39] jouncebot: nowandnext [21:54:40] No deployments scheduled for the next 8 hour(s) and 5 minute(s) [21:54:40] In 8 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230811T0600) [21:54:46] (03CR) 10Urbanecm: [C: 03+2] GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958) (owner: 10Urbanecm) [21:55:01] taavi: i'm backporting it [21:55:08] thanks [21:55:14] thanks for writing the fix! [21:56:24] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:59:47] (03Merged) 10jenkins-bot: GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958) (owner: 10Urbanecm) [22:00:05] that was quick [22:00:35] (well, it's CA) [22:00:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]] [22:00:46] T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958 [22:00:59] it does not run the gate, that's the normal CI speed for it :P [22:01:06] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:52] yeah, it's CA :)) [22:02:08] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [22:03:52] !log urbanecm@deploy1002 urbanecm: Continuing with sync [22:08:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50431 and previous config saved to /var/cache/conftool/dbconfig/20230810-220759-ladsgroup.json [22:08:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [22:08:06] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:08:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [22:08:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T342617)', diff saved to https://phabricator.wikimedia.org/P50432 and previous config saved to /var/cache/conftool/dbconfig/20230810-220820-ladsgroup.json [22:10:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]] (duration: 09m 48s) [22:10:28] T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958 [22:12:13] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:14:16] * urbanecm done [22:15:26] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:34] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:04] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:11] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:34:01] (03PS1) 10BCornwall: Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) [22:34:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:34:32] (03CR) 10CI reject: [V: 04-1] Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [22:35:10] (03PS2) 10BCornwall: Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) [22:38:26] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:07] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:44:23] (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:45:20] (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:47:54] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:48:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:49:02] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:49:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [22:49:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [22:49:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [22:49:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [22:50:32] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f5a7ff82280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [22:50:32] org/wiki/Search%23Administration [22:50:44] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [22:52:38] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [22:52:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:53:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:53:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:53:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:53:46] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:55:10] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 619, active_shards: 1421, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [22:55:10] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:55:35] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@ff0a21b]: (no justification provided) [22:55:55] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@ff0a21b]: (no justification provided) (duration: 00m 20s) [22:59:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:04:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:05:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:09:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:10:46] (03CR) 10BCornwall: [V: 03+1] "lintian is happy; piuparts is giving me trouble for something unrelated." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [23:13:01] (03CR) 10CI reject: [V: 04-1] Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [23:19:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:24:09] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:25:37] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:29:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:30:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:39:02] (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:39:32] (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:40:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:44:44] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:20] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:04] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:48:17] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947393 [23:49:32] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:50:22] jinxer-wm: hush [23:52:36] (03PS1) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944) [23:53:46] (03PS1) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 [23:53:59] (03PS2) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (https://phabricator.wikimedia.org/T343944) [23:54:41] (03CR) 10CI reject: [V: 04-1] clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle) [23:54:47] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:55:17] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:56:09] (03PS2) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944)