[00:03:16] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2023-08-22 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:05:32] <jinxer-wm>	 (DatasourceError) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:08:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:10:32] <jinxer-wm>	 (DatasourceError) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:16:40] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-08-22 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:20:36] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2023-08-22 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:31:36] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2023-08-22 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:39:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490
[00:39:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490 (owner: 10TrainBranchBot)
[00:43:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003']
[00:50:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be2003']
[00:52:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) 05Resolved→03Open
[00:54:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[00:54:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[00:55:00] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490 (owner: 10TrainBranchBot)
[01:25:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10nshahquinn-wmf) FYI, Urllib3 version 2, released in April 2023, [removed the fallback from serverAltName to commonName](https://github.com/urllib3/urllib3/blob/main/CHA...
[01:36:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003']
[01:37:46] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[01:42:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be2003']
[01:43:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[01:43:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[01:43:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye
[01:43:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with...
[01:43:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye
[01:44:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye
[01:44:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye
[01:44:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with...
[01:50:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye
[01:50:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye
[01:50:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye
[01:50:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye
[01:50:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye
[01:50:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye
[02:08:57] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage
[02:31:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage
[02:33:57] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm)
[02:45:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[03:01:42] <jinxer-wm>	 (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:31:10] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:26] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:50:41] <wikibugs>	 (03PS1) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953750 (https://phabricator.wikimedia.org/T345316)
[03:52:55] <wikibugs>	 (03Abandoned) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953750 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[04:00:16] <wikibugs>	 (03PS1) 10Anzx: tlywiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316)
[04:28:42] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:29:26] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:47:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52120 and previous config saved to /var/cache/conftool/dbconfig/20230831-044746-root.json
[04:50:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/953651
[04:52:46] <wikibugs>	 (03CR) 10Winston Sung: "This change is ready for review." (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[04:54:03] <wikibugs>	 (03PS4) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035)
[04:54:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[04:55:30] <wikibugs>	 (03PS5) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035)
[04:56:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T345223
[04:56:45] <stashbot>	 T345223: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T345223
[04:57:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T345223
[04:57:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1173 with weight 0 T345223', diff saved to https://phabricator.wikimedia.org/P52121 and previous config saved to /var/cache/conftool/dbconfig/20230831-045719-marostegui.json
[04:59:32] <wikibugs>	 (03PS4) 10KartikMistry: Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro)
[05:01:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/953487 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot)
[05:02:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52122 and previous config saved to /var/cache/conftool/dbconfig/20230831-050250-root.json
[05:16:33] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-08-29-191442-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T345170)
[05:16:35] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/953488 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot)
[05:17:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52123 and previous config saved to /var/cache/conftool/dbconfig/20230831-051755-root.json
[05:28:11] <marostegui>	 !log Starting s6 eqiad failover from db1131 to db1173 - T345223
[05:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:17] <stashbot>	 T345223: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T345223
[05:28:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T345223', diff saved to https://phabricator.wikimedia.org/P52124 and previous config saved to /var/cache/conftool/dbconfig/20230831-052825-marostegui.json
[05:28:29] <stashbot>	 marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[05:28:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T345223', diff saved to https://phabricator.wikimedia.org/P52125 and previous config saved to /var/cache/conftool/dbconfig/20230831-052852-marostegui.json
[05:30:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 T345223', diff saved to https://phabricator.wikimedia.org/P52126 and previous config saved to /var/cache/conftool/dbconfig/20230831-053035-root.json
[05:31:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/953651 (owner: 10Marostegui)
[05:32:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/953488 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot)
[05:33:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52127 and previous config saved to /var/cache/conftool/dbconfig/20230831-053300-root.json
[05:34:45] <wikibugs>	 (03PS1) 10Marostegui: db1131: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953753
[05:35:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1131: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953753 (owner: 10Marostegui)
[05:37:46] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[05:43:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182 T344309', diff saved to https://phabricator.wikimedia.org/P52128 and previous config saved to /var/cache/conftool/dbconfig/20230831-054305-root.json
[05:43:13] <stashbot>	 T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31  - https://phabricator.wikimedia.org/T344309
[05:43:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52129 and previous config saved to /var/cache/conftool/dbconfig/20230831-054314-root.json
[05:45:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52130 and previous config saved to /var/cache/conftool/dbconfig/20230831-054542-root.json
[05:48:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52131 and previous config saved to /var/cache/conftool/dbconfig/20230831-054805-root.json
[05:58:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52132 and previous config saved to /var/cache/conftool/dbconfig/20230831-055819-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0600).
[06:00:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52133 and previous config saved to /var/cache/conftool/dbconfig/20230831-060047-root.json
[06:13:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52134 and previous config saved to /var/cache/conftool/dbconfig/20230831-061324-root.json
[06:15:44] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "LGTM, just one small missing bit and a couple of suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[06:15:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52135 and previous config saved to /var/cache/conftool/dbconfig/20230831-061551-root.json
[06:22:37] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section and Content Translation in 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211)
[06:28:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52136 and previous config saved to /var/cache/conftool/dbconfig/20230831-062829-root.json
[06:30:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52137 and previous config saved to /var/cache/conftool/dbconfig/20230831-063056-root.json
[06:33:57] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:41:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[06:42:58] <wikibugs>	 (03PS4) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563)
[06:43:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52138 and previous config saved to /var/cache/conftool/dbconfig/20230831-064333-root.json
[06:46:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52139 and previous config saved to /var/cache/conftool/dbconfig/20230831-064601-root.json
[06:48:05] <wikibugs>	 (03PS5) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563)
[06:48:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: mesh: add tracing support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[06:53:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org
[06:57:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org
[06:58:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52140 and previous config saved to /var/cache/conftool/dbconfig/20230831-065838-root.json
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0700)
[07:00:05] <jouncebot>	 thed and kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:15] * kart_ is here
[07:00:40] <apergos>	 morning! we have no trainees signed up today but two patches to go. kart_ I assume you are self-deploy? I don't know where thedj is, not in channel at the moment. so kart_ you'll go first if that's ok. 
[07:00:47] <Amir1>	 kart_: you can self serve?
[07:00:57] <kart_>	 Amir1: yeah
[07:01:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52141 and previous config saved to /var/cache/conftool/dbconfig/20230831-070105-root.json
[07:01:40] <Amir1>	 Have fun!
[07:01:43] <jinxer-wm>	 (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:02:19] <hashar>	 o/ :]
[07:02:34] <hashar>	  apergos are the trainings always on Thursday?
[07:02:39] <apergos>	 yes they are
[07:02:49] <apergos>	 it's a fixed slot, see the dpeloyment calendar :-)
[07:03:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro)
[07:03:41] <hashar>	 great
[07:03:57] <apergos>	 there's a workboard to request a training, if you know someone interested
[07:03:59] <hashar>	 Tyler asked me to participate so I will join next week session 8)
[07:04:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro)
[07:04:27] <apergos>	 oh!  you're interested -)  well yes, sounds great. make it official by making a request on that phab board if you like
[07:04:41] <apergos>	 https://phabricator.wikimedia.org/project/board/5265/
[07:05:25] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]]
[07:05:45] <apergos>	 ah there we go, I was wondering what was happening :-)
[07:07:00] <logmsgbot>	 !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:09:50] <logmsgbot>	 !log kartik@deploy1002 abi and kartik: Continuing with sync
[07:12:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "confd: Make confd_prometheus_metrics.py 3.4-compatible" [puppet] - 10https://gerrit.wikimedia.org/r/953238 (owner: 10Muehlenhoff)
[07:13:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52142 and previous config saved to /var/cache/conftool/dbconfig/20230831-071343-root.json
[07:15:44] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]] (duration: 10m 18s)
[07:16:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52143 and previous config saved to /var/cache/conftool/dbconfig/20230831-071610-root.json
[07:19:10] <kart_>	 apergos: I'm done with config change deployment.
[07:19:19] <apergos>	 great! 
[07:19:31] <apergos>	 still no thedj unfortunately
[07:19:40] <kart_>	 :/
[07:20:09] <apergos>	 if anyone has another way to reach them, I'll remain here with the window open for another 15 minutes or so
[07:20:58] <wikibugs>	 (03PS10) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott)
[07:21:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848
[07:23:44] <wikibugs>	 (03PS3) 10Muehlenhoff: Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848
[07:24:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:25:04] <wikibugs>	 (03CR) 10Muehlenhoff: Openstack: remove support for Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff)
[07:25:53] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43070/console" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[07:27:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] local_dev: Update image [puppet] - 10https://gerrit.wikimedia.org/r/953205 (owner: 10Muehlenhoff)
[07:28:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52144 and previous config saved to /var/cache/conftool/dbconfig/20230831-072848-root.json
[07:29:28] <wikibugs>	 (03CR) 10Muehlenhoff: Stop building stretch images and update monitoring for the docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff)
[07:29:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Stop building stretch images and update monitoring for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff)
[07:30:14] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Look good, links point to the right locations." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff)
[07:31:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52145 and previous config saved to /var/cache/conftool/dbconfig/20230831-073115-root.json
[07:31:29] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update links to create an account and password reset to point to Bitu [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff)
[07:32:44] <wikibugs>	 (03PS3) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008)
[07:33:03] <wikibugs>	 (03PS4) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008)
[07:35:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede)
[07:37:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[07:37:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[07:37:08] <apergos>	 it looks like something came up for thedj, so hopefully they will reschedule, I'll close the window for today 
[07:37:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52146 and previous config saved to /var/cache/conftool/dbconfig/20230831-073713-ladsgroup.json
[07:37:19] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[07:37:34] <apergos>	 !log UTC morning backport and config window done 
[07:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[07:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[07:39:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52147 and previous config saved to /var/cache/conftool/dbconfig/20230831-073921-ladsgroup.json
[07:40:22] <wikibugs>	 (03PS4) 10Jelto: gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[07:44:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43071/console" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[07:44:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959
[07:44:59] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959
[07:48:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10ayounsi) FYI there is now a pending diff for: ` [edit forwarding-options dhcp-relay] +    /* T337345 */ +    forward-snooped-clients non-...
[07:49:33] <wikibugs>	 (03PS5) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008)
[07:50:49] <wikibugs>	 (03PS1) 10Muehlenhoff: networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960
[07:50:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959 (owner: 10Muehlenhoff)
[07:51:08] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[07:51:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff)
[07:52:48] <wikibugs>	 (03PS2) 10Muehlenhoff: networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960
[07:52:54] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1009.eqiad.wmnet
[07:54:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52148 and previous config saved to /var/cache/conftool/dbconfig/20230831-075428-ladsgroup.json
[07:56:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:57:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:57:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52149 and previous config saved to /var/cache/conftool/dbconfig/20230831-075709-ladsgroup.json
[07:57:15] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[07:58:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott)
[08:00:56] <Amir1>	 jouncebot: nowandnext
[08:00:56] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 59 minute(s)
[08:00:56] <jouncebot>	 In 1 hour(s) and 59 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000)
[08:00:56] <jouncebot>	 In 1 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000)
[08:03:30] <Amir1>	 slyngs: we coordinate here
[08:03:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott)
[08:03:54] <logmsgbot>	 !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1009.eqiad.wmnet
[08:03:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[08:04:08] <Amir1>	 slyngs: do you know about https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage
[08:04:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Fix cache_upload timeouts in single-backend sites [puppet] - 10https://gerrit.wikimedia.org/r/953700 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[08:04:24] <wikibugs>	 (03Merged) 10jenkins-bot: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott)
[08:04:32] <slyngs>	 Amir1: I do not
[08:04:53] <Amir1>	 you need to install a browser extension to test it when the patch arrives to mwdebug hosts
[08:04:56] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]]
[08:05:01] <stashbot>	 T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226
[08:05:29] <taavi>	 Amir1: wikitech is not compatible with that
[08:05:42] <Amir1>	 ah, yeah, okay
[08:05:57] <Amir1>	 I keep forgetting it's a special snowflake
[08:06:13] <Amir1>	 slyngs: scratch that, it doesnt' work with that
[08:06:14] <slyngs>	 Hopefully removing the signup will push us towards it not being special
[08:06:28] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755)
[08:06:31] <taavi>	 yep, it's one step closer. still many to go though :/
[08:06:33] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and andrew: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:06:42] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and andrew: Continuing with sync
[08:06:58] <Amir1>	 what's the next blocker?
[08:07:16] <Amir1>	 also when are we going to remove labtestwiki?
[08:07:47] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet
[08:08:17] <taavi>	 very good question. we'd need a replacement for it's 2fa functionality I think, not sure if there are plans for idp/idm instances against the codfw1dev ldap cluster
[08:08:38] <Amir1>	 slyngs: the design :(
[08:08:43] <taavi>	 and the next blocker is SSH key management in IDM, that would let us undeploy OSM
[08:08:57] <Amir1>	 Open Street Map?
[08:09:05] <taavi>	 openstackmanager
[08:09:12] <Amir1>	 ah, that makes more sense
[08:09:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52150 and previous config saved to /var/cache/conftool/dbconfig/20230831-080934-ladsgroup.json
[08:09:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez)
[08:10:54] <slyngs>	 We're already working on the feature to remove openstackmanager
[08:11:20] <slyngs>	 Not much is left really, mostly SSH key management, which has been implemented but not enabled
[08:11:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede)
[08:12:05] <slyngs>	 I'll create a new developer account and check if it's able to login to wikitech
[08:12:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:12:51] <wikibugs>	 (03PS1) 10Ayounsi: gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322)
[08:13:02] <taavi>	 is there a log of new accounts created via idm?
[08:13:30] <Amir1>	 if you manage to allow wikitech be integrated with the rest of the fleet, I'll buy you eight beers in the next in-person, one of each incident we had because of wikitech being a special snowflake
[08:13:47] <Amir1>	 one *for
[08:13:50] <slyngs>	 taavi: Yes, right now moritzm and I are getting an email on all signups. It's also logged on the server in the application log
[08:14:18] <Amir1>	 maybe sense it to logstash too
[08:14:20] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet
[08:14:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[08:14:31] <Amir1>	 but we probably need something more performant 
[08:14:38] <Amir1>	 *permanent 
[08:15:00] <wikibugs>	 (03Merged) 10jenkins-bot: gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[08:15:03] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]] (duration: 10m 06s)
[08:15:09] <Amir1>	 slyngs: done ^
[08:15:11] <stashbot>	 T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226
[08:15:14] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet
[08:15:56] <Amir1>	 it couldn't deploy it to snapshot1010.eqiad.wmnet, mw2287.codfw.wmnet and mw2285.codfw.wmnet
[08:16:27] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305)
[08:16:43] <slyngs>	 It worked.... AMAZING... I'm, I created a an new account, promptly forgot the username and failed to login, then recovered the username and now I can log in
[08:17:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52151 and previous config saved to /var/cache/conftool/dbconfig/20230831-081705-ladsgroup.json
[08:17:11] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[08:17:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:17:36] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez)
[08:17:44] <slyngs>	 Amir1: If possible I'd like the beers spread out over time, I can sleep after more than two beers, to horrors of growing old
[08:17:47] <slyngs>	 the
[08:17:56] <Amir1>	 haha, sure :D
[08:18:05] <taavi>	 slyngs: the gerrit sign up link needs updating it seems
[08:18:12] <taavi>	 I also just fixed a bunch of wikitech pages
[08:18:41] <Amir1>	 we probably need to send an email to wikitech and possibly a message in engineering-all 
[08:18:46] <taavi>	 yes please
[08:18:49] <slyngs>	 I'll ping hashar about gerrit, moritzm is fixing idp... And thank you for updating the wikitech pages. 
[08:19:08] <taavi>	 Amir1: do you know if we can customize the error message on https://wikitech.wikimedia.org/wiki/Special:CreateAccount without affecting other pages?
[08:19:16] <moritzm>	 yeah, the updated CAS package with links pointing to Bitu will go out later the day
[08:19:19] <slyngs>	 Right, I'll do that now, and include a link to the patch, in case we need to revert
[08:19:34] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr3-ulsfo
[08:19:56] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet
[08:20:20] <Amir1>	 taavi: woudl this help? https://wikitech.wikimedia.org/wiki/Special:CreateAccount/?uselang=qqx
[08:20:42] <slyngs>	 Who to email, just sre@... we need to hit the developers as well
[08:20:52] <taavi>	 wikitech-l?
[08:20:59] <Amir1>	 yeah, wikitech-l
[08:21:01] <slyngs>	 Yes, just remembers that :-)
[08:21:11] <taavi>	 Amir1: I guess I can change MediaWiki:permissionserrorstext-withaction, but I think that would affect all of the special pages and not just that specific one
[08:21:17] <taavi>	 unless I can use a magic word to vary the message?
[08:21:27] <vgutierrez>	 !log set send_timeout to 3620s in the upload cluster via cumin to avoid a varnish restart https://gerrit.wikimedia.org/r/c/operations/puppet/+/953678 - T341755
[08:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:32] <Amir1>	 we could probably but let's not
[08:21:32] <stashbot>	 T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout - https://phabricator.wikimedia.org/T341755
[08:23:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-eqiad cluster: Reboot kafka nodes
[08:23:28] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1012.eqiad.wmnet
[08:24:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo
[08:24:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr4-ulsfo
[08:24:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52152 and previous config saved to /var/cache/conftool/dbconfig/20230831-082440-ladsgroup.json
[08:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[08:24:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond from Juniper, does it make sens? > “If the customer would like to use OIDC they enter in their token for us to use and authenticate. The vast majority of users sign...
[08:24:46] <hashar>	 slyngs: hi, please file whatever request in Phabricator against #gerrit :-)
[08:24:46] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[08:24:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[08:24:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:25:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:25:02] <slyngs>	 hashar: Will do, thank you
[08:25:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52153 and previous config saved to /var/cache/conftool/dbconfig/20230831-082508-ladsgroup.json
[08:25:30] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1030 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:25:43] <hashar>	 the sign up link is  https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount&returnto=Gerrit/NewUser  and it is defined somewhere in operations/puppet under modules/gerrit 
[08:26:16] <hashar>	 and that URL is probably used in various on wikis documentation ( mw:Git and subpages come to mind )
[08:26:47] <slyngs>	 Ah, okay, I can do the patch and Phabricator task then
[08:27:17] <hashar>	 and potentially we could get Gerrit to migrate to OAUTH / SAML instead of talking to LDAP directly, but that is a side track :-)
[08:27:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52154 and previous config saved to /var/cache/conftool/dbconfig/20230831-082717-ladsgroup.json
[08:27:33] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos)
[08:27:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos)
[08:28:11] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos)
[08:28:37] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]]
[08:28:42] <stashbot>	 T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305
[08:28:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo
[08:28:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-eqiad
[08:28:56] <wikibugs>	 (03CR) 10David Caro: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[08:30:02] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1012.eqiad.wmnet
[08:30:14] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1013.eqiad.wmnet
[08:32:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52155 and previous config saved to /var/cache/conftool/dbconfig/20230831-083211-ladsgroup.json
[08:33:33] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad
[08:33:35] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-esams
[08:33:36] <Amir1>	 08:31:33 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2023-08-31-082844-publish (ran as mwdeploy@kubernetes1008.eqiad.wmnet) returned [255]: ssh: connect to host kubernetes1008.eqiad.wmnet port 22: Connection timed out
[08:36:12] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and isaranto: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:36:18] <stashbot>	 T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305
[08:36:27] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1013.eqiad.wmnet
[08:36:54] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and isaranto: Continuing with sync
[08:37:04] <Amir1>	 confirmed it fixes the issue
[08:38:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams
[08:38:24] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-codfw
[08:38:37] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[08:38:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:39:20] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet
[08:40:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey)
[08:40:39] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[08:40:48] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[08:40:53] <wikibugs>	 (03PS1) 10Ayounsi: Enable GNMI on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544)
[08:41:55] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[08:41:55] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:42:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52156 and previous config saved to /var/cache/conftool/dbconfig/20230831-084224-ladsgroup.json
[08:42:31] <wikibugs>	 (03PS1) 10Ayounsi: gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322)
[08:42:58] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw
[08:43:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-drmrs
[08:44:51] <wikibugs>	 (03PS5) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485)
[08:45:54] <wikibugs>	 (03PS2) 10Ayounsi: gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322)
[08:46:37] <wikibugs>	 (03PS6) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485)
[08:47:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52157 and previous config saved to /var/cache/conftool/dbconfig/20230831-084717-ladsgroup.json
[08:47:20] <claime>	 Amir1: The connection timed out probably because we're rebooting the k8s server right now,. That's the pre-pull of the image on all k8s host failing
[08:47:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs
[08:47:42] <Amir1>	 noted
[08:47:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqdfw
[08:47:50] <wikibugs>	 (03PS36) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[08:48:02] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[08:48:38] <wikibugs>	 (03PS7) 10Volans: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[08:48:58] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede)
[08:49:12] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[08:50:36] <logmsgbot>	 !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1014.eqiad.wmnet
[08:50:41] <wikibugs>	 (03CR) 10JMeybohm: "Nice! But I think this it not how sextant works currently. AIUI it considers minor version changes incompatible/not backwards compatible a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris)
[08:51:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:51:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[08:51:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[08:51:57] <wikibugs>	 (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[08:52:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqdfw
[08:52:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqiad
[08:52:28] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[08:52:30] <wikibugs>	 (03PS37) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[08:52:35] <wikibugs>	 (03CR) 10Volans: "PCC fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[08:52:37] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet
[08:54:58] <wikibugs>	 (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[08:56:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:56:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:56:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad
[08:56:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqord
[08:57:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52158 and previous config saved to /var/cache/conftool/dbconfig/20230831-085731-ladsgroup.json
[08:57:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) 05Open→03Resolved
[08:59:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:00:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10jbond) >>! In T158757#9132594, @nshahquinn-wmf wrote: > FYI, Urllib3 version 2, released in April 2023, [removed the fallback from serverAltName to commonName](https://...
[09:01:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqord
[09:01:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqsin
[09:01:41] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:02:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52159 and previous config saved to /var/cache/conftool/dbconfig/20230831-090223-ladsgroup.json
[09:02:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[09:02:31] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:02:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[09:02:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52160 and previous config saved to /var/cache/conftool/dbconfig/20230831-090244-ladsgroup.json
[09:03:55] <logmsgbot>	 !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1015.eqiad.wmnet
[09:06:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin
[09:06:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-esams
[09:11:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams
[09:11:23] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr3-eqsin
[09:11:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:12:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52161 and previous config saved to /var/cache/conftool/dbconfig/20230831-091237-ladsgroup.json
[09:12:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[09:12:43] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:12:48] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet
[09:12:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[09:12:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52162 and previous config saved to /var/cache/conftool/dbconfig/20230831-091258-ladsgroup.json
[09:13:11] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF)
[09:14:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris)
[09:14:57] <wikibugs>	 (03Merged) 10jenkins-bot: Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris)
[09:15:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52163 and previous config saved to /var/cache/conftool/dbconfig/20230831-091507-ladsgroup.json
[09:15:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan)
[09:15:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[09:16:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin
[09:16:30] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:17:40] <wikibugs>	 (03PS1) 10Slyngshede: C:gerrit Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226)
[09:18:45] <wikibugs>	 (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff)
[09:21:58] <wikibugs>	 (03PS1) 10Jelto: gitlab: enable local_gems in devtools test instance [puppet] - 10https://gerrit.wikimedia.org/r/953968 (https://phabricator.wikimedia.org/T337570)
[09:22:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52164 and previous config saved to /var/cache/conftool/dbconfig/20230831-092231-ladsgroup.json
[09:22:38] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:23:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) >>! In T306238#9132987, @ayounsi wrote: > @jbond from Juniper, does it make sens? >> “If the customer would like to use OIDC they enter in their token for us to use and authe...
[09:24:16] <logmsgbot>	 !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1016.eqiad.wmnet
[09:24:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[09:25:24] <wikibugs>	 (03Abandoned) 10Jbond: firewall: move conntrack logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[09:25:35] <wikibugs>	 (03Abandoned) 10Jbond: firewall: add conntrack require on the active firewall [puppet] - 10https://gerrit.wikimedia.org/r/953610 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[09:26:37] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: enable local_gems in devtools test instance [puppet] - 10https://gerrit.wikimedia.org/r/953968 (https://phabricator.wikimedia.org/T337570) (owner: 10Jelto)
[09:26:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:26:54] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus: scrape gNMIc endpoint [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322)
[09:27:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485)
[09:27:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:28:08] <wikibugs>	 (03PS2) 10Ayounsi: Prometheus: scrape gNMIc endpoint [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322)
[09:28:16] <wikibugs>	 (03PS13) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497)
[09:28:34] <wikibugs>	 (03PS3) 10Ayounsi: Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322)
[09:29:44] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[09:30:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52165 and previous config saved to /var/cache/conftool/dbconfig/20230831-093013-ladsgroup.json
[09:30:29] <wikibugs>	 (03PS14) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497)
[09:30:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[09:30:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[09:30:50] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:30:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:44] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43073/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[09:32:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:33:10] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:32] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1017.eqiad.wmnet
[09:35:09] <moritzm>	 !log imported cas 6.6.11+wmf11u1 to apt.wikimedia.org
[09:35:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:49] <claime>	 Checking mw-web
[09:35:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[09:36:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:37:34] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] (duration: 68m 57s)
[09:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52166 and previous config saved to /var/cache/conftool/dbconfig/20230831-093738-ladsgroup.json
[09:37:40] <stashbot>	 T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305
[09:37:46] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[09:37:50] <wikibugs>	 (03PS4) 10Ayounsi: Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322)
[09:38:04] <wikibugs>	 (03PS1) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972
[09:38:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[09:38:45] <Amir1>	 09:37:34 Finished scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] (duration: 68m 57s)
[09:38:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) >>! In T345273#9132938, @ayounsi wrote: > FYI there is now a pending diff for: > ` > [edit forwarding-options dhcp-relay] > +...
[09:40:00] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (2) Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:40:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:40:17] <claime>	 Not sure why it fired
[09:40:43] <claime>	 Amir1: Did your scap deploy fail? Do you want me to redeploy mw-api-ext?
[09:40:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[09:41:00] <Amir1>	 I can re-do it if the reboots are done
[09:41:07] <claime>	 Did you see other releases fail?
[09:41:14] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:41:23] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[09:41:30] <claime>	 Amir1: checking reboot status
[09:41:55] <Amir1>	  eqiad: Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
[09:42:18] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308)
[09:42:58] <wikibugs>	 (03PS1) 10JMeybohm: jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253)
[09:44:03] <wikibugs>	 (03CR) 10Jbond: "I have abandoned this and the other change and restored https://gerrit.wikimedia.org/r/c/operations/puppet/+/952889/12" [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[09:45:10] <logmsgbot>	 !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1017.eqiad.wmnet
[09:45:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52167 and previous config saved to /var/cache/conftool/dbconfig/20230831-094520-ladsgroup.json
[09:45:34] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:45:45] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet
[09:45:49] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[09:45:50] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:45:51] <wikibugs>	 (03CR) 10Jbond: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[09:45:56] <stashbot>	 ariel@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[09:46:38] <wikibugs>	 (03CR) 10JMeybohm: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[09:46:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:47:18] <wikibugs>	 (03PS1) 10Cathal Mooney: Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273)
[09:47:34] <claime>	 Amir1: They're still running but as long as we don't re-run the deployment code will be out of sync between bare metal and mw-on-k8s
[09:47:39] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm)
[09:47:42] <claime>	 I'll try and push it through
[09:48:19] <Amir1>	 sure
[09:48:24] <wikibugs>	 (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[09:49:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[09:49:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[09:50:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[09:50:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[09:50:50] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:50:53] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync
[09:50:59] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync
[09:51:11] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet
[09:51:29] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1002.eqiad.wmnet
[09:51:42] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus: gnmi re-label fix [puppet] - 10https://gerrit.wikimedia.org/r/953976
[09:51:42] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[09:51:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[09:52:24] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:52:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:52:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 59, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52168 and previous config saved to /var/cache/conftool/dbconfig/20230831-095244-ladsgroup.json
[09:52:46] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:52:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync
[09:52:53] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync
[09:53:14] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:53:46] <claime>	 Amir1: k, should be all good
[09:53:56] <Amir1>	 awesome.thanks
[09:54:10] <Amir1>	 can you let me know once the reboots are over?
[09:54:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus: gnmi re-label fix [puppet] - 10https://gerrit.wikimedia.org/r/953976 (owner: 10Ayounsi)
[09:54:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:55:00] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: (2) Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:56:47] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[09:56:52] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[09:57:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-codfw
[09:57:37] <wikibugs>	 (03Merged) 10jenkins-bot: cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[09:58:39] <claime>	 Amir1: sure :)
[09:59:11] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1002.eqiad.wmnet
[09:59:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I think the problem I'm experiencing can be addressed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/953685 in a less invasive " [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[09:59:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 1m 40s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[10:00:05] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000)
[10:00:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52169 and previous config saved to /var/cache/conftool/dbconfig/20230831-100026-ladsgroup.json
[10:00:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[10:00:37] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:00:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[10:00:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52170 and previous config saved to /var/cache/conftool/dbconfig/20230831-100047-ladsgroup.json
[10:00:57] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet
[10:01:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "updated" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[10:01:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney)
[10:01:33] <wikibugs>	 (03PS4) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361)
[10:01:44] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[10:01:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw
[10:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:02:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] hieradata: add jaeger collector to service catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[10:02:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52171 and previous config saved to /var/cache/conftool/dbconfig/20230831-100256-ladsgroup.json
[10:03:10] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 8m 17s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[10:05:23] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi)
[10:06:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) What kind of analytics data you need access to? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Level...
[10:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:07:30] <wikibugs>	 (03PS5) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361)
[10:07:44] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet
[10:07:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi)
[10:07:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52172 and previous config saved to /var/cache/conftool/dbconfig/20230831-100750-ladsgroup.json
[10:07:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:07:53] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy it once the k8s reboots are done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[10:07:59] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:08:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52173 and previous config saved to /var/cache/conftool/dbconfig/20230831-100811-ladsgroup.json
[10:08:15] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet
[10:10:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[10:11:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[10:15:12] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet
[10:15:15] <wikibugs>	 (03PS2) 10Ladsgroup: admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535)
[10:16:25] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply
[10:16:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 04-1] "Manual record is fine but we should remove it from Netbox if we want to do that, see comment inline probably best to leave this handled fr" [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez)
[10:16:43] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-drmrs
[10:17:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply
[10:17:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[10:18:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52174 and previous config saved to /var/cache/conftool/dbconfig/20230831-101802-ladsgroup.json
[10:18:44] <wikibugs>	 (03PS3) 10Cathal Mooney: Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937)
[10:19:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) (owner: 10Ladsgroup)
[10:19:47] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979
[10:20:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[10:20:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[10:21:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs
[10:21:30] <stashbot>	 ayounsi@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[10:21:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[10:21:42] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
[10:22:45] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001"
[10:22:59] <wikibugs>	 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon)
[10:23:15] <wikibugs>	 (03PS2) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028)
[10:23:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply
[10:23:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001"
[10:23:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:23:37] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[10:23:55] <wikibugs>	 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) [I spoke to @KOfori about this, and they suggested opening a phab task tagged traffic was the best next step]
[10:24:11] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[10:25:18] <wikibugs>	 (03PS3) 10Ladsgroup: admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535)
[10:25:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-eqiad cluster: Reboot kafka nodes
[10:25:22] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) (owner: 10Ladsgroup)
[10:26:22] <wikibugs>	 (03PS3) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028)
[10:26:48] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney)
[10:27:40] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
[10:28:01] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede)
[10:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52175 and previous config saved to /var/cache/conftool/dbconfig/20230831-102813-ladsgroup.json
[10:28:19] <stashbot>	 ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[10:28:21] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:30:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43076/console" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:31:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/output/953674/43076/" [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:31:11] <wikibugs>	 (03PS6) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350)
[10:31:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/output/953674/43076/" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:31:35] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[10:33:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:33:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney)
[10:33:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52176 and previous config saved to /var/cache/conftool/dbconfig/20230831-103308-ladsgroup.json
[10:33:17] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:33:57] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:21] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[10:34:50] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[10:35:17] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:35:22] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[10:35:38] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 04-1] "The "Á" at the end of wikipedia-wordmark-tly.svg looks weird, like it's bene squished to make the entire letter fit the height of the "V"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[10:36:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[10:37:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply
[10:38:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply
[10:38:32] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:38:57] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: add ensure support to the ferm class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[10:39:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Maintenance
[10:39:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Maintenance
[10:39:58] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:40:18] <icinga-wm>	 RECOVERY - MariaDB memory on clouddb1017 is OK: OK Memory 0% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:40:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/953980
[10:41:46] <moritzm>	 !log installing cjose security updates
[10:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC not as expected: https://puppet-compiler.wmflabs.org/output/953685/43074/" [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[10:42:15] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet
[10:42:38] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy
[10:43:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[10:43:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52177 and previous config saved to /var/cache/conftool/dbconfig/20230831-104319-ladsgroup.json
[10:43:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[10:44:04] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:44:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[10:44:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for cjose [puppet] - 10https://gerrit.wikimedia.org/r/953981
[10:44:48] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2: https://wikitech.wikimedia.org/wiki/HAProxy
[10:45:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240)
[10:46:14] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240)
[10:46:27] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[10:46:53] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet
[10:47:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[10:47:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[10:47:22] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@90f280e]: (no justification provided)
[10:47:31] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@90f280e]: (no justification provided) (duration: 00m 09s)
[10:48:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cjose [puppet] - 10https://gerrit.wikimedia.org/r/953981 (owner: 10Muehlenhoff)
[10:48:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[10:48:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52178 and previous config saved to /var/cache/conftool/dbconfig/20230831-104815-ladsgroup.json
[10:48:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:48:22] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:48:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:48:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52179 and previous config saved to /var/cache/conftool/dbconfig/20230831-104836-ladsgroup.json
[10:48:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez)
[10:48:54] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[10:49:03] <wikibugs>	 (03PS15) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497)
[10:49:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[10:50:00] <moritzm>	 !log installing flask security updates on buster
[10:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[10:50:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52180 and previous config saved to /var/cache/conftool/dbconfig/20230831-105044-ladsgroup.json
[10:50:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[10:50:51] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[10:50:52] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[10:51:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[10:51:46] <icinga-wm>	 PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:52:07] <wikibugs>	 (03PS16) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497)
[10:52:12] <wikibugs>	 (03CR) 10Jbond: "fixed" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[10:52:55] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[10:53:10] <icinga-wm>	 RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:44] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[10:53:44] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:53:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43077/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[10:54:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:54:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[10:54:43] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1006.eqiad.wmnet
[10:55:40] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:23] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney)
[10:56:54] <wikibugs>	 (03Merged) 10jenkins-bot: Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney)
[10:57:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:23] <wikibugs>	 (03PS1) 10Hnowlan: device-analytics: use global AQS configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/953982 (https://phabricator.wikimedia.org/T320967)
[10:58:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52181 and previous config saved to /var/cache/conftool/dbconfig/20230831-105826-ladsgroup.json
[11:01:11] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff)
[11:01:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10Vgutierrez) Happy to provide assistance and guidance if needed but caching is technically controlled by the backend services and not by the CDN. the CDN imp...
[11:01:57] <jinxer-wm>	 (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52182 and previous config saved to /var/cache/conftool/dbconfig/20230831-110551-ladsgroup.json
[11:06:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes1025.eqiad.wmnet
[11:06:59] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes1025.eqiad.wmnet
[11:07:43] <wikibugs>	 (03PS1) 10Jbond: run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496)
[11:08:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[11:08:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[11:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52183 and previous config saved to /var/cache/conftool/dbconfig/20230831-111332-ladsgroup.json
[11:13:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:13:38] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:13:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:13:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52184 and previous config saved to /var/cache/conftool/dbconfig/20230831-111353-ladsgroup.json
[11:15:06] <wikibugs>	 (03PS2) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972
[11:15:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:10] <wikibugs>	 (03PS2) 10Jbond: run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496)
[11:17:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! PCC is also fine." [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[11:19:20] <wikibugs>	 (03PS1) 10Jbond: puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496)
[11:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52185 and previous config saved to /var/cache/conftool/dbconfig/20230831-112057-ladsgroup.json
[11:24:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[11:25:43] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[11:27:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) 05Open→03Resolved
[11:27:06] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney)
[11:30:32] <wikibugs>	 10SRE-tools, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) p:05Triage→03Medium
[11:31:02] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[11:31:35] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/953577/43058/" [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah)
[11:31:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52186 and previous config saved to /var/cache/conftool/dbconfig/20230831-113136-ladsgroup.json
[11:31:42] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:32:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet
[11:32:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff)
[11:33:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/953980 (owner: 10Muehlenhoff)
[11:33:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P52187 and previous config saved to /var/cache/conftool/dbconfig/20230831-113324-root.json
[11:33:27] <claime>	 Amir1: eqiad k8s reboots done, give a few minutes to jayme so he can reboot the masters and you're good to keep deploying
[11:34:29] <wikibugs>	 (03PS1) 10Kosta Harlan: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275)
[11:34:31] <wikibugs>	 (03PS1) 10Kosta Harlan: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275)
[11:35:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/953555 (https://phabricator.wikimedia.org/T339835) (owner: 10Marostegui)
[11:35:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond)
[11:35:39] <marostegui>	 moritzm: ok to merge your change?
[11:36:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52189 and previous config saved to /var/cache/conftool/dbconfig/20230831-113603-ladsgroup.json
[11:36:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:36:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:36:08] <moritzm>	 marostegui: yes, please
[11:36:14] <marostegui>	 moritzm: done
[11:36:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52190 and previous config saved to /var/cache/conftool/dbconfig/20230831-113613-ladsgroup.json
[11:36:34] <moritzm>	 cheers
[11:37:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[11:38:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert)
[11:38:23] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) Also worth noting that version >= 6 are not currently working with spicerack (T328775)
[11:39:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert)
[11:39:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52191 and previous config saved to /var/cache/conftool/dbconfig/20230831-113922-ladsgroup.json
[11:39:32] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:39:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:39:47] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert)
[11:39:53] <wikibugs>	 (03CR) 10JMeybohm: "@jbond can you maybe take a look please?" [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm)
[11:40:32] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] "Wait until code is present on all branches running in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan)
[11:40:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:40:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:43:29] <Amir1>	 claime: thanks!
[11:44:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:44:46] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubemaster1001.eqiad.wmnet
[11:45:20] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:44] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780)
[11:46:34] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet
[11:46:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[11:46:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52192 and previous config saved to /var/cache/conftool/dbconfig/20230831-114642-ladsgroup.json
[11:46:52] <icinga-wm>	 RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1132.eqiad.wmnet onto db1119.eqiad.wmnet
[11:47:11] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet
[11:48:04] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:49:13] <wikibugs>	 (03PS1) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001
[11:49:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[11:50:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] ferm: add ensure support to the ferm class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[11:50:08] <wikibugs>	 (03PS2) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294)
[11:51:15] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Raise traffic to 4% [puppet] - 10https://gerrit.wikimedia.org/r/954002 (https://phabricator.wikimedia.org/T341780)
[11:51:35] <wikibugs>	 (03PS2) 10Sohom Datta: Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098)
[11:52:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:31] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Move k8s-ingress-aux to lvs_setuo [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178)
[11:54:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52193 and previous config saved to /var/cache/conftool/dbconfig/20230831-115429-ladsgroup.json
[11:55:21] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43079/console" [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[11:59:02] <wikibugs>	 (03PS2) 10JMeybohm: service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178)
[11:59:26] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubemaster1002.eqiad.wmnet
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200)
[12:00:20] <wikibugs>	 (03PS3) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972
[12:00:23] <jayme>	 Amir1: gogogo
[12:00:56] <Amir1>	 :D
[12:01:01] <Amir1>	 isaranto: shall we deploy?
[12:01:16] <Amir1>	 (enabling LW in itwiki and so on)
[12:01:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52194 and previous config saved to /var/cache/conftool/dbconfig/20230831-120148-ladsgroup.json
[12:02:18] <aqu>	 !log About to deploy analytics refinery (weekly train)
[12:02:20] <isaranto>	 Amir1: yes! I am here to test
[12:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:32] <Amir1>	 jouncebot: nowandnext
[12:02:33] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200)
[12:02:33] <jouncebot>	 In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300)
[12:02:38] <Amir1>	 cool
[12:02:50] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[12:02:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[12:03:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[12:03:21] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0]: Regular analytics weekly train [analytics/refinery@06203c0]
[12:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos)
[12:03:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-on-k8s: Raise traffic to 4% [puppet] - 10https://gerrit.wikimedia.org/r/954002 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[12:03:56] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]]
[12:04:01] <stashbot>	 T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308
[12:04:04] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:04:40] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43080/console" [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm)
[12:05:08] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139)
[12:05:11] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm)
[12:05:34] <logmsgbot>	 !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[12:05:54] <Amir1>	 isaranto: it's live in mwdebug
[12:06:14] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:52] <vgutierrez>	 jouncebot: nowandnext
[12:06:52] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200)
[12:06:52] <jouncebot>	 In 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300)
[12:07:42] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:46] <Amir1>	 vgutierrez: I'm deploying :D
[12:09:19] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet
[12:09:30] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a1-codfw.mgmt.codfw.wmnet
[12:09:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52195 and previous config saved to /var/cache/conftool/dbconfig/20230831-120935-ladsgroup.json
[12:09:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet
[12:09:49] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a3-codfw.mgmt.codfw.wmnet
[12:09:58] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet
[12:10:01] <jinxer-wm>	 (DatasourceError) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:10:01] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a2-codfw.mgmt.codfw.wmnet
[12:10:07] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet
[12:10:09] <Amir1>	 isaranto: are you testing?
[12:10:09] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a3-codfw.mgmt.codfw.wmnet
[12:10:15] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a4-codfw.mgmt.codfw.wmnet
[12:10:27] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a7-codfw.mgmt.codfw.wmnet
[12:10:29] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet
[12:10:37] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a7-codfw.mgmt.codfw.wmnet
[12:10:37] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a6-codfw.mgmt.codfw.wmnet
[12:10:39] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a5-codfw.mgmt.codfw.wmnet
[12:10:41] <wikibugs>	 (03PS4) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972
[12:10:48] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a8-codfw.mgmt.codfw.wmnet
[12:10:51] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b2-codfw.mgmt.codfw.wmnet
[12:10:53] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:10:59] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-b2-codfw.mgmt.codfw.wmnet
[12:12:56] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "maybe include the required firewalling changes in the same patch?" [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[12:13:45] <isaranto>	 Amir1: yes. I am getting a 503 when running a job for itwiki -> `Service failed to respond properly: Failed to make LiftWing request to [http://localhost:6031/v1/models/itwiki-damaging:predict], There was a problem during the HTTP request: 503 Service Unavailable`
[12:14:19] <wikibugs>	 (03PS3) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294)
[12:14:22] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:14:52] <jinxer-wm>	 (DatasourceError) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:15:11] <Amir1>	 that seems to be a problem from LW 
[12:15:23] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139)
[12:15:32] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:15:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:15:35] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:15:37] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0]: Regular analytics weekly train [analytics/refinery@06203c0] (duration: 12m 15s)
[12:16:06] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-08-31-061147-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683)
[12:16:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet
[12:16:19] <isaranto>	 Amir1: checking from another host
[12:16:21] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a1-codfw.mgmt.codfw.wmnet
[12:16:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet
[12:16:32] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:16:34] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet
[12:16:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:16:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet
[12:16:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52196 and previous config saved to /var/cache/conftool/dbconfig/20230831-121654-ladsgroup.json
[12:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:17:00] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:17:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:17:11] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a6-codfw.mgmt.codfw.wmnet
[12:17:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:17:13] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b5-codfw.mgmt.codfw.wmnet
[12:17:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b6-codfw.mgmt.codfw.wmnet
[12:17:14] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:17:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b4-codfw.mgmt.codfw.wmnet
[12:17:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:17:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:17:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:17:17] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b3-codfw.mgmt.codfw.wmnet
[12:17:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52197 and previous config saved to /var/cache/conftool/dbconfig/20230831-121721-ladsgroup.json
[12:17:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Move WMCS haproxy scrapes to WMCS prometheus instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[12:17:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:17:46] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-b3-codfw.mgmt.codfw.wmnet
[12:18:12] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-b6-codfw.mgmt.codfw.wmnet
[12:18:44] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:19:11] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139)
[12:19:20] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:19:26] <topranks>	 I think I may have borked netbox running all those network.provision cookbooks in parallel 
[12:19:46] <logmsgbot>	 !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[12:19:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm)
[12:20:01] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:20:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 (owner: 10Slyngshede)
[12:20:15] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:20:23] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:20:48] <topranks>	 It triggered 16 x sre.dns.netbox cookbook executions in parallel after which netbox started to struggle 
[12:21:10] <topranks>	 I've aborted/they've timed out, I'll do it serially instead 
[12:21:17] <topranks>	 sorry for any problems 
[12:21:40] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:21:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:22:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:10] <wikibugs>	 (03PS1) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[12:23:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi)
[12:23:20] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:23:27] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:23:35] <isaranto>	 Amir1: it works now, I don't know why. 
[12:23:44] <wikibugs>	 (03Merged) 10jenkins-bot: Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi)
[12:23:47] <logmsgbot>	 !log ladsgroup@deploy1002 isaranto and ladsgroup: Continuing with sync
[12:23:57] <Amir1>	 I'll push it forward will see
[12:24:07] <isaranto>	 There isnt an issue with LW. perhaps had to do with the envoy proxy
[12:24:13] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137)
[12:24:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52198 and previous config saved to /var/cache/conftool/dbconfig/20230831-122441-ladsgroup.json
[12:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[12:24:51] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:24:53] <jinxer-wm>	 (DatasourceError) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:24:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[12:24:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[12:24:59] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 (owner: 10Slyngshede)
[12:24:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:25:01] <jayme>	 !log restarting pybal on lvs1020 - T325178
[12:25:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:25:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52199 and previous config saved to /var/cache/conftool/dbconfig/20230831-122502-ladsgroup.json
[12:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:31] <stashbot>	 T325178: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178
[12:25:36] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137)
[12:26:17] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a8-codfw.mgmt.codfw.wmnet
[12:26:22] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:44] <wikibugs>	 (03PS1) 10Stevemunene: idp: add datahub as oidc service [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874)
[12:27:01] <Amir1>	 I'm not seeing anything so far in the log
[12:27:04] <Amir1>	 *logs
[12:27:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[12:27:54] <jayme>	 !log restarting pybal on lvs1019 - T325178
[12:27:59] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001"
[12:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:59] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:28:26] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:28:48] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138)
[12:29:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:29:16] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:44] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:29:47] <wikibugs>	 (03CR) 10Sergio Gimeno: "Scheduled September 6th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno)
[12:29:53] <wikibugs>	 (03CR) 10Sergio Gimeno: "Scheduled September 6th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[12:30:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:01] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]] (duration: 27m 05s)
[12:31:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954010 (https://phabricator.wikimedia.org/T345220)
[12:31:09] <stashbot>	 T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308
[12:31:23] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:31:41] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138)
[12:31:43] <wikibugs>	 (03PS5) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137)
[12:31:45] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138)
[12:32:01] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0] (thin): Regular analytics weekly train THIN [analytics/refinery@06203c0]
[12:32:06] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0] (thin): Regular analytics weekly train THIN [analytics/refinery@06203c0] (duration: 00m 04s)
[12:32:12] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@06203c0]
[12:32:13] <wikibugs>	 (03CR) 10Peter Fischer: "Thanks, the config parameters look a lot cleaner now! I haven't understood how and where they are actually passed to the application." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[12:32:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9130881, @MoritzMuehlenhoff wrote: > From a high level view that seems perfectly fine. We initiate non-wiki offboardings from...
[12:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:33:35] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Move k8s-ingress-aux to production [puppet] - 10https://gerrit.wikimedia.org/r/954011 (https://phabricator.wikimedia.org/T325178)
[12:34:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52200 and previous config saved to /var/cache/conftool/dbconfig/20230831-123428-ladsgroup.json
[12:34:34] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:35:20] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@06203c0] (duration: 03m 07s)
[12:35:30] <wikibugs>	 (03PS2) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[12:35:55] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device ssw1-a8-codfw
[12:36:05] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a8-codfw
[12:36:21] <wikibugs>	 (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[12:37:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service::catalog: Move k8s-ingress-aux to production [puppet] - 10https://gerrit.wikimedia.org/r/954011 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm)
[12:37:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[12:38:14] <jinxer-wm>	 (DatasourceError) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:38:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10ABran-WMF)
[12:38:57] <wikibugs>	 (03PS1) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653
[12:39:16] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:39:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond)
[12:39:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
[12:39:41] * Lucas_WMDE deploying now
[12:39:46] <Lucas_WMDE>	 (security fix)
[12:39:48] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet
[12:40:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond)
[12:41:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond)
[12:41:38] <wikibugs>	 (03PS2) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653
[12:41:54] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila)
[12:41:58] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila)
[12:42:20] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[12:42:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52201 and previous config saved to /var/cache/conftool/dbconfig/20230831-124240-ladsgroup.json
[12:42:46] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:42:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10KOfori) This is approved. Thanks.
[12:43:12] <wikibugs>	 (03PS3) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[12:43:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond)
[12:44:24] <wikibugs>	 (03PS1) 10Elukey: knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058)
[12:45:00] <wikibugs>	 (03PS3) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653
[12:45:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[12:46:56] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet
[12:47:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet
[12:47:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
[12:47:34] <logmsgbot>	 !log lucaswerkmeister-wmde Deployed security patch for T345064
[12:47:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
[12:48:14] <jinxer-wm>	 (DatasourceError) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:48:44] <Lucas_WMDE>	 (still deploying, wmf.24 now)
[12:48:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo)
[12:49:05] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device ssw1-a1-codfw
[12:49:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a1-codfw
[12:49:28] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet
[12:49:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52202 and previous config saved to /var/cache/conftool/dbconfig/20230831-124934-ladsgroup.json
[12:50:29] <wikibugs>	 (03PS1) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365)
[12:50:59] <wikibugs>	 (03PS1) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365)
[12:51:24] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:26] <wikibugs>	 (03PS1) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316)
[12:51:55] <wikibugs>	 (03PS1) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497)
[12:51:57] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey)
[12:52:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey)
[12:52:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Oh, I forgot: You will have to add something to mesh.networkpolicy as well, allowing the pods to egress to the otel collector." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[12:52:54] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo)
[12:53:16] <wikibugs>	 (03Abandoned) 10Anzx: tlywiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[12:53:35] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - cmooney@cumin1001"
[12:54:20] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet
[12:54:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - cmooney@cumin1001"
[12:54:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:54:53] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[12:54:54] <logmsgbot>	 !log lucaswerkmeister-wmde Deployed security patch for T345064
[12:55:11] * Lucas_WMDE done
[12:55:20] <Lucas_WMDE>	 (and probably won’t be around for the backport window in a few minutes, I’m afraid)
[12:55:49] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet
[12:55:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:57:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:57:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52203 and previous config saved to /var/cache/conftool/dbconfig/20230831-125746-ladsgroup.json
[12:58:57] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:59:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300).
[13:00:04] <jouncebot>	 gmodena, Sohom_Datta, sergi0, and arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <sergi0>	 hello
[13:00:27] <gmodena>	 hey hey
[13:00:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:00:38] <gmodena>	 ^ joal 
[13:00:54] <Sohom_Datta>	 o/
[13:00:55] <wikibugs>	 (03PS4) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:00:57] <joal>	 Ack gmodena
[13:02:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo)
[13:02:42] <aqu>	 !log Deployed refinery using scap, then deployed onto hdfs
[13:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:03:42] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet
[13:04:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52204 and previous config saved to /var/cache/conftool/dbconfig/20230831-130441-ladsgroup.json
[13:05:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:06:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:06:43] <wikibugs>	 (03PS5) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:06:48] <wikibugs>	 (03CR) 10Jbond: "pcc is still failing but its complete enough to take a look" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:07:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:45] <sergi0>	 Is any deployer around? I can deploy otherwise
[13:08:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954010 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff)
[13:08:24] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet
[13:08:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:08:42] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:57] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:09:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:09:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[13:10:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo)
[13:10:21] <wikibugs>	 (03CR) 10Arlolra: "recheck" [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[13:10:36] <sergi0>	 gmodena: do you need assistance for the backport?
[13:11:38] <gmodena>	 sergi0 I should be able to test the change once it's deployed.
[13:12:00] <sergi0>	 ok, starting with yours
[13:12:09] <gmodena>	 sergi0 awesome, thanks
[13:12:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:12:30] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff)
[13:12:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52205 and previous config saved to /var/cache/conftool/dbconfig/20230831-131252-ladsgroup.json
[13:13:05] <wikibugs>	 (03PS6) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361)
[13:13:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena)
[13:13:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:13:22] <wikibugs>	 (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[13:13:26] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294)
[13:13:58] <wikibugs>	 (03Merged) 10jenkins-bot: Remove rc1.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena)
[13:14:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:14:28] <logmsgbot>	 !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]]
[13:14:34] <stashbot>	 T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959
[13:14:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[13:14:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322)
[13:15:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:15:29] <wikibugs>	 (03PS2) 10Majavah: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294)
[13:15:50] <wikibugs>	 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10phaultfinder)
[13:16:03] <logmsgbot>	 !log sgimeno@deploy1002 gmodena and sgimeno: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:16:29] <sergi0>	 gmodena: you can test the change in debug server
[13:16:35] <gmodena>	 sergi0 ack
[13:16:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi)
[13:16:46] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) @joanna_borun Asking for sign up of @Arnaud for global root production access as a new member of Data Persistence Team, as you are one of the people being able to approve that. Thank you!
[13:17:14] <gmodena>	 sergi0 everything works as expected. 
[13:17:18] <wikibugs>	 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm issue was resolved
[13:17:28] <logmsgbot>	 !log sgimeno@deploy1002 gmodena and sgimeno: Continuing with sync
[13:17:34] <sergi0>	 syncing
[13:17:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
[13:17:44] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet
[13:18:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:18:46] <sergi0>	 Sohom_Datta: your patch is next, are you around?
[13:19:05] <Sohom_Datta>	 yep yep
[13:19:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52206 and previous config saved to /var/cache/conftool/dbconfig/20230831-131947-ladsgroup.json
[13:19:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:19:53] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:20:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:20:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52207 and previous config saved to /var/cache/conftool/dbconfig/20230831-132009-ladsgroup.json
[13:20:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1010.eqiad.wmnet with OS bullseye
[13:20:29] <wikibugs>	 (03CR) 10Jon Harald Søby: "This is not something that's wrong with the patch per se, but the most active contributor asked us to change the logo from "Vikipediya" to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[13:20:43] <gmodena>	 sergi0 many thanks for the help
[13:21:18] <sergi0>	 gmodena: your patch is still syncing
[13:21:31] <gmodena>	 sergi0 ack
[13:22:08] <wikibugs>	 (03PS1) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054
[13:22:15] <wikibugs>	 (03PS4) 10Anzx: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316)
[13:22:26] <wikibugs>	 (03PS38) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[13:22:52] <wikibugs>	 (03PS2) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054
[13:23:04] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet
[13:23:11] <wikibugs>	 (03CR) 10Anzx: tlywiki: add metanamespace , timezone, sitename (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[13:23:25] <wikibugs>	 (03PS3) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054
[13:23:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:23:52] <wikibugs>	 (03PS6) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:24:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
[13:24:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[13:24:42] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet
[13:25:01] <logmsgbot>	 !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]] (duration: 10m 33s)
[13:25:09] <stashbot>	 T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959
[13:25:17] <sergi0>	 gmodena: the change is live
[13:25:30] <gmodena>	 sergi0 ack. All looks good.
[13:25:33] <gmodena>	 thanks again
[13:25:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet
[13:25:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[13:25:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[13:26:37] <sergi0>	 you are welcome
[13:26:37] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322)
[13:26:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054 (owner: 10Elukey)
[13:27:15] <wikibugs>	 (03Merged) 10jenkins-bot: Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[13:27:33] <logmsgbot>	 !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]]
[13:27:39] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[13:27:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43086/console" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:27:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52208 and previous config saved to /var/cache/conftool/dbconfig/20230831-132759-ladsgroup.json
[13:28:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[13:28:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1132.eqiad.wmnet onto db1119.eqiad.wmnet
[13:28:05] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[13:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52209 and previous config saved to /var/cache/conftool/dbconfig/20230831-132820-ladsgroup.json
[13:28:23] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 #page on db1132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6039.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:28:34] <sukhe>	 hello
[13:28:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:28:41] <moritzm>	 around
[13:28:45] * Emperor here
[13:28:46] <sukhe>	 depool?
[13:28:57] <claime>	 here too
[13:29:02] <Amir1>	 I'm around now
[13:29:04] <Amir1>	 let me check
[13:29:06] <taavi>	 expired downtime? https://sal.toolforge.org/production?p=0&q=db1132&d=
[13:29:06] <sukhe>	 Amir1: thanks
[13:29:11] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno and soda: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:29:19] <Amir1>	 first depool
[13:29:26] <sukhe>	 doing
[13:29:28] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:36] <taavi>	 it's not pooled in the first place
[13:29:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:29:58] <sukhe>	 from sal?
[13:30:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[13:30:14] <taavi>	 at least it's not visible on https://noc.wikimedia.org/db.php
[13:30:22] <taavi>	 and SAL shows marostegui doing maintenance on it earlier today
[13:30:28] <sergi0>	 Sohom_Datta: you can test
[13:30:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52210 and previous config saved to /var/cache/conftool/dbconfig/20230831-133029-ladsgroup.json
[13:30:37] <sukhe>	 nothing to commit
[13:30:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi)
[13:30:44] <Amir1>	 good, it's not pooled
[13:30:50] <wikibugs>	 (03PS7) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:30:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:31:04] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-e1-eqiad
[13:31:15] <Amir1>	 the downtime should be for 24 hours
[13:31:27] <Sohom_Datta>	 On it
[13:31:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: host reimage
[13:31:44] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh2002.wikimedia.org with OS bookworm
[13:32:18] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh2002.wikimedia.org with OS bookworm
[13:32:42] <Amir1>	 sigh, I hate this thing with cookbooks
[13:32:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:32:50] <Amir1>	 it downtimed the host for 48 hours
[13:32:58] <Amir1>	 but removed the downtime once the clone was done
[13:33:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:33:07] <Amir1>	 it has happened before
[13:33:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-eqiad
[13:33:27] <volans>	 Amir1: which cookbook?
[13:33:33] <Amir1>	 (in another cookbook)
[13:33:40] <Amir1>	 clone cookbook and upgrade
[13:33:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:33:52] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[13:33:56] <Amir1>	 with self.alerting_hosts(hosts_to_downtime).downtimed(self.admin_reason, duration=timedelta(hours=48)):
[13:34:10] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:34:19] <Sohom_Datta>	 sergi0: Looks good :)
[13:34:26] <volans>	 Amir1: you can wait for icinga being optimal
[13:34:27] <Sohom_Datta>	 Tested on enwikisource
[13:34:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-f1-eqiad
[13:34:49] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[13:34:56] <sergi0>	 syncing
[13:35:00] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno and soda: Continuing with sync
[13:35:02] <volans>	 or add any other check before exiting the context manager
[13:35:31] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device ssw1-f1-eqiad
[13:35:42] <jbond>	 !log swap puppetdb-api and puppetdb-api-next gerrit:940384
[13:35:43] <wikibugs>	 (03PS8) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:35:48] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:35:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb-api: swap the production and next environments [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:35:54] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: host reimage
[13:35:57] <Amir1>	 volans: any way to tell it not remove the downtime?
[13:36:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:12] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:31] <volans>	 what's the problem? you can 1) check for icinga optimal before exiting the context manager so that when it exits icinga is all green
[13:36:33] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet
[13:36:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:36:50] <volans>	 2) don't use the context manager and just set the downtime, paying the price it will be downtimed for longer
[13:36:50] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:37:03] <volans>	 3) add any custom check to ensure your host is happy before removing the donwtime
[13:37:41] <volans>	 unforunutely there is no concept of "all optimal" in the alertmanager world
[13:37:46] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[13:37:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: drop 'cluster' for gnmi job [puppet] - 10https://gerrit.wikimedia.org/r/954055 (https://phabricator.wikimedia.org/T326322)
[13:38:12] <wikibugs>	 (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[13:38:16] <Amir1>	 I'll go with the second option
[13:38:23] <volans>	 why not 1?
[13:38:29] <volans>	 it's 2 lines of code
[13:38:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
[13:38:38] <wikibugs>	 (03PS9) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669)
[13:38:38] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet
[13:38:44] <volans>	 if the alert comes from icinga
[13:38:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:39:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52211 and previous config saved to /var/cache/conftool/dbconfig/20230831-133905-ladsgroup.json
[13:39:08] <Amir1>	 because it could take even a day for the replica to catch up
[13:39:11] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:39:43] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "\o/ working on tools also:" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[13:39:55] <marostegui>	 Downtime expired 
[13:40:20] <volans>	 ack
[13:40:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: drop 'cluster' for gnmi job [puppet] - 10https://gerrit.wikimedia.org/r/954055 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi)
[13:40:39] <wikibugs>	 (03CR) 10David Caro: "Hmm... those errors seem unrelated :/" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[13:40:48] <wikibugs>	 (03PS1) 10FNegri: [openstack] upgrade codfw1dev to Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285)
[13:40:49] <godog>	 jbond: merging your change too
[13:40:57] <marostegui>	 actually I downtimed this host for 24h
[13:40:59] <marostegui>	 Why did it page?
[13:41:17] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 #page on db1132 is OK: OK slave_sql_lag Replication lag: 26.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:41:35] <Amir1>	 marostegui: because the cookbook removes the downtime
[13:41:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52212 and previous config saved to /var/cache/conftool/dbconfig/20230831-134136-root.json
[13:41:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:41:46] <marostegui>	 Amir1: aaaah ok!
[13:41:56] <sergi0>	 does anyone know if and what action needs to be taken when 1 proxies had sync errors during scap?
[13:42:34] <logmsgbot>	 !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 15m 00s)
[13:42:39] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[13:42:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) @wiki_willy  @Marostegui  @RobH    can we get some clarification on racking. ticket list Speed:1G Vlan. but came with 10g cards and on procurement doc  list 10g....
[13:43:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:44:09] <jbond>	 godog: ack thanks
[13:44:10] <claime>	 sergi0: That issue probably is bad timing between the appserver reboots we're doing and deployment, was it a codfw host?
[13:44:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 13 CORE_DIFF 6 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43089/console" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:44:34] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet
[13:44:36] <wikibugs>	 (03CR) 10Jbond: "Latest pcc i think looks good, it has some differences but i think that's the change you want" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond)
[13:44:37] <sergi0>	 claime: mw2259.codfw.wmnet indeed
[13:44:46] <wikibugs>	 (03PS1) 10Arnaudb: adding arnaudb to proper groups [puppet] - 10https://gerrit.wikimedia.org/r/953491
[13:44:52] <claime>	 sergi0: yeah, it's just been rebooted
[13:45:15] <sergi0>	 claime: the backport process ended failing though, how do I proceed with this?
[13:45:20] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet
[13:45:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] adding arnaudb to proper groups [puppet] - 10https://gerrit.wikimedia.org/r/953491 (owner: 10Arnaudb)
[13:45:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52213 and previous config saved to /var/cache/conftool/dbconfig/20230831-134535-ladsgroup.json
[13:46:01] <claime>	 Well it's back up now, so I guess you can redo the backport, but I'd like someone that knows more about the deployment process than me to weigh in, Amir1 ?
[13:46:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet
[13:46:10] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet
[13:46:22] <Amir1>	 yeah, just redo the backport
[13:46:41] <sergi0>	 alright, thanks
[13:46:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:47:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh2002.wikimedia.org with reason: host reimage
[13:47:08] <logmsgbot>	 !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]]
[13:47:19] <gehel>	 inflatador, ryankemper: would you have time to look into the readahead failure above? ^^^
[13:48:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[13:48:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[13:48:41] <wikibugs>	 (03PS2) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193
[13:48:46] <logmsgbot>	 !log sgimeno@deploy1002 soda and sgimeno: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:48:51] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[13:49:08] <logmsgbot>	 !log sgimeno@deploy1002 soda and sgimeno: Continuing with sync
[13:49:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 173
[13:49:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 173
[13:50:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh2002.wikimedia.org with reason: host reimage
[13:50:17] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43090/console" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney)
[13:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:51:56] <inflatador>	 gehel :eyes
[13:52:37] <wikibugs>	 (03PS1) 10Ladsgroup: mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059
[13:52:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:52:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) We order them with 10G just in case, but we only use the 1G one.
[13:53:21] <wikibugs>	 (03PS3) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193
[13:53:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testreduce1002.eqiad.wmnet
[13:53:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:53:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet
[13:54:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:54:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52214 and previous config saved to /var/cache/conftool/dbconfig/20230831-135411-ladsgroup.json
[13:54:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
[13:54:52] <wikibugs>	 (03PS4) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193
[13:55:18] <wikibugs>	 (03CR) 10Muehlenhoff: adding arnaudb to proper groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953491 (owner: 10Arnaudb)
[13:56:01] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43091/console" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney)
[13:56:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testreduce1002.eqiad.wmnet - jmm@cumin2002"
[13:56:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) We have data https://grafana.wikimedia.org/d/iUATvNzSz/network-queues ! And a doc: https://wikitech.wikimedia.org/wiki/Netwo...
[13:56:30] <wikibugs>	 (03PS2) 10Arnaudb: admin [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[13:56:42] <logmsgbot>	 !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 09m 33s)
[13:56:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52215 and previous config saved to /var/cache/conftool/dbconfig/20230831-135641-root.json
[13:56:47] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[13:56:49] <wikibugs>	 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345356 (10phaultfinder)
[13:56:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testreduce1002.eqiad.wmnet - jmm@cumin2002"
[13:56:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:56:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache testreduce1002.eqiad.wmnet on all recursors
[13:56:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testreduce1002.eqiad.wmnet on all recursors
[13:57:02] <sergi0>	 Amir1: 2 (other) hosts had scap-cdb-rebuild errors. Should I redo the backport again? Or on the contrary pause until hosts are back
[13:57:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testreduce1002.eqiad.wmnet - jmm@cumin2002"
[13:57:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[13:58:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testreduce1002.eqiad.wmnet - jmm@cumin2002"
[13:58:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup)
[13:58:15] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet
[13:58:25] <wikibugs>	 (03PS3) 10Arnaudb: admin: Add arnaudb to root user group As part of his onboarding we have arnaudb doing the modifications and asked him to remove his modifications [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[13:58:52] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet
[13:58:57] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:59:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:59:10] <sergi0>	 retrying
[13:59:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add arnaudb to root user group As part of his onboarding we have arnaudb doing the modifications and asked him to remove his modifications [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[13:59:41] <logmsgbot>	 !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]]
[13:59:48] <wikibugs>	 (03PS1) 10Gehel: java: introduce a standard list of GC logging options for Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/954060 (https://phabricator.wikimedia.org/T345355)
[13:59:50] <wikibugs>	 (03PS1) 10Gehel: query_service: use the standard GC logging options [puppet] - 10https://gerrit.wikimedia.org/r/954061 (https://phabricator.wikimedia.org/T345355)
[14:00:01] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testreduce1002.eqiad.wmnet with OS bookworm
[14:00:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] java: introduce a standard list of GC logging options for Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/954060 (https://phabricator.wikimedia.org/T345355) (owner: 10Gehel)
[14:00:35] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
[14:00:41] <wikibugs>	 (03PS4) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[14:00:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52216 and previous config saved to /var/cache/conftool/dbconfig/20230831-140041-ladsgroup.json
[14:01:09] <wikibugs>	 (03CR) 10Arnaudb: admin: Add arnaudb to root user group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:01:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:01:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
[14:01:19] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno and soda: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:01:24] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:24] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno and soda: Continuing with sync
[14:01:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:02:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:03:02] <wikibugs>	 (03PS1) 10Hashar: gitlab: add project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231)
[14:03:04] <wikibugs>	 (03PS1) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231)
[14:03:06] <wikibugs>	 (03PS1) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231)
[14:04:20] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:44] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:06] <wikibugs>	 (03PS5) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[14:05:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1001"
[14:05:44] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet
[14:06:22] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the devtools Puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:06:36] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet
[14:06:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "LGTM, but I would like to see a PCC for cloudgw at least." [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[14:06:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: (3) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:06:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
[14:07:18] <logmsgbot>	 !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 07m 36s)
[14:07:23] <stashbot>	 T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098
[14:07:31] <Sohom_Datta>	 yay!
[14:07:32] <sergi0>	 Sohom_Datta: your change is finally live
[14:07:35] <sergi0>	 :)
[14:07:35] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:07:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
[14:07:41] <sergi0>	 going for mine
[14:07:45] <wikibugs>	 (03CR) 10Jon Harald Søby: "The wordmark still looks weird. I uploaded a new version of it to Commons now; could you update that one here as well?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[14:07:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[14:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[14:08:42] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I have miss read how the template is expanded. It requires all features to be listed in order to enable them when they are enabled by defa" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:08:57] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:02] <logmsgbot>	 !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]]
[14:09:10] <stashbot>	 T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138
[14:09:10] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[14:09:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52217 and previous config saved to /var/cache/conftool/dbconfig/20230831-140917-ladsgroup.json
[14:10:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[14:10:43] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:10:58] <logmsgbot>	 !log sgimeno@deploy1002 sgimeno: Continuing with sync
[14:11:25] <wikibugs>	 (03PS39) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[14:11:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[14:11:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52219 and previous config saved to /var/cache/conftool/dbconfig/20230831-141146-root.json
[14:11:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage
[14:12:16] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan)
[14:12:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:12:39] <wikibugs>	 (03PS1) 10Ayounsi: gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322)
[14:13:05] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet
[14:13:35] <wikibugs>	 (03PS6) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[14:14:35] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
[14:14:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[14:15:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
[14:15:44] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:15:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet
[14:15:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52220 and previous config saved to /var/cache/conftool/dbconfig/20230831-141547-ladsgroup.json
[14:15:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:15:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff)
[14:16:02] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:16:22] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:16:22] <wikibugs>	 (03PS7) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343)
[14:16:37] <logmsgbot>	 !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]] (duration: 07m 34s)
[14:16:40] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:44] <stashbot>	 T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138
[14:16:44] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[14:16:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage
[14:17:36] <wikibugs>	 (03PS6) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137)
[14:17:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh2002.wikimedia.org with OS bookworm
[14:17:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh2002.wikimedia.org with OS bookworm completed: - doh2002 (**PASS**)   - Downtimed on Icinga/Al...
[14:17:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:18:05] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138)
[14:18:42] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh)
[14:18:48] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:57] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) p:05Triage→03High a:03joanna_borun
[14:19:19] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:19:52] <wikibugs>	 (03PS1) 10Cwhite: alertmanager: emit helpful info for DatasourceError alerts [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358)
[14:19:54] <wikibugs>	 (03PS1) 10Cwhite: logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358)
[14:20:07] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:20:14] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:24] <wikibugs>	 (03PS2) 10Cwhite: logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358)
[14:20:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1015 - A 6. U 33. Port 32. Cableid: 2839
[14:20:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I have verified everything, LGTM, only missing an additional +1 and Foundations or Director approval." [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:21:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[14:22:11] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet
[14:22:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:50] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet
[14:23:05] <wikibugs>	 (03CR) 10Cwhite: "I'm not sure I like gating on alertname, but this class of alerts (along with DatasourceNoData) are "special" in a sense. Please let me kn" [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[14:23:58] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi)
[14:24:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52221 and previous config saved to /var/cache/conftool/dbconfig/20230831-142424-ladsgroup.json
[14:24:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[14:24:31] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:24:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[14:24:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno)
[14:24:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52222 and previous config saved to /var/cache/conftool/dbconfig/20230831-142445-ladsgroup.json
[14:24:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno)
[14:24:55] <wikibugs>	 (03PS1) 10Hnowlan: service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380)
[14:25:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet
[14:25:02] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:25:28] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52223 and previous config saved to /var/cache/conftool/dbconfig/20230831-142651-root.json
[14:26:59] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm now, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney)
[14:27:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
[14:27:43] <wikibugs>	 (03PS1) 10Muehlenhoff: package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068
[14:27:55] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: move monitoring of research pages to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/954069 (https://phabricator.wikimedia.org/T334511)
[14:27:57] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511)
[14:28:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068 (owner: 10Muehlenhoff)
[14:28:22] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:36] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:28:39] <wikibugs>	 (03PS2) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316)
[14:29:00] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet
[14:29:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272)
[14:29:36] <wikibugs>	 (03CR) 10David Caro: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[14:29:54] <wikibugs>	 (03CR) 10Anzx: tlywiki: Add logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[14:31:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[14:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[14:31:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testreduce1002.eqiad.wmnet with OS bookworm
[14:31:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testreduce1002.eqiad.wmnet
[14:32:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272) (owner: 10Cathal Mooney)
[14:32:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272) (owner: 10Cathal Mooney)
[14:34:26] <wikibugs>	 (03CR) 10Jon Harald Søby: tlywiki: add metanamespace , timezone, sitename (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[14:34:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet
[14:35:17] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[14:36:31] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[14:38:08] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[14:39:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan)
[14:39:23] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of research pages to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/954069 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[14:40:42] <wikibugs>	 (03CR) 10Anzx: tlywiki: add metanamespace , timezone, sitename (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[14:41:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52224 and previous config saved to /var/cache/conftool/dbconfig/20230831-144155-root.json
[14:42:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/953491/43094/" [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[14:42:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1031 - A 6. U 37. port 30 Cableid 4017 kubernetes1030 - A 6. U 36. port 31 Cableid 1917 kubernetes1029 - A 6. U 35. port24 Cableid: 1947
[14:43:44] <wikibugs>	 (03PS2) 10Hashar: gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231)
[14:43:46] <wikibugs>	 (03PS2) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231)
[14:43:48] <wikibugs>	 (03PS2) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231)
[14:43:51] <wikibugs>	 (03PS1) 10Hashar: gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231)
[14:44:21] <wikibugs>	 (03Abandoned) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:44:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52225 and previous config saved to /var/cache/conftool/dbconfig/20230831-144425-ladsgroup.json
[14:44:31] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:44:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:44:59] <wikibugs>	 (03PS40) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[14:45:01] <wikibugs>	 (03PS1) 10David Caro: sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073
[14:45:36] <wikibugs>	 (03Abandoned) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:46:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:46:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073 (owner: 10David Caro)
[14:46:28] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS bullseye
[14:46:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet
[14:47:12] <wikibugs>	 (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[14:47:22] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:25] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10JJMC89)
[14:47:28] <wikibugs>	 (03PS2) 10Hashar: gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231)
[14:47:30] <wikibugs>	 (03PS3) 10Hashar: gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231)
[14:48:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[14:48:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073 (owner: 10David Caro)
[14:48:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[14:49:19] <wikibugs>	 (03CR) 10Hashar: "Our parameter use singular form `project_features` whereas upstream it is `default_projects_features` (with plural form for project).  Ali" [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar)
[14:50:16] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:36] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074
[14:52:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet
[14:52:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet
[14:53:11] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10MoritzMuehlenhoff)
[14:54:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[14:54:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[14:55:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: emit helpful info for DatasourceError alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[14:55:16] <wikibugs>	 (03PS2) 10Muehlenhoff: package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068
[14:56:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet
[14:56:59] <cwhite>	 dcaro: are your puppet changes ready for deploy?
[14:57:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52226 and previous config saved to /var/cache/conftool/dbconfig/20230831-145700-root.json
[14:57:41] <dcaro>	 cwhite: yes thanks!
[14:57:59] <cwhite>	 done
[14:58:04] <dcaro>	 thank :)
[14:59:28] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10MoritzMuehlenhoff) I've rolled out CAS 6.6.11 with an additional patch which points to Bitu for password resets and signups.
[14:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52227 and previous config saved to /var/cache/conftool/dbconfig/20230831-145931-ladsgroup.json
[14:59:56] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Point IDP login page to IDM for signup [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/927661 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff)
[14:59:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
[15:00:07] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet
[15:02:11] <wikibugs>	 (03PS1) 10Majavah: wikitech: Disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226)
[15:02:34] <wikibugs>	 (03CR) 10Majavah: "Is IDM ready for this yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah)
[15:05:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet
[15:05:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Agree with David about having more specific runbooks, otherwise lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[15:06:03] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158)
[15:06:31] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[15:06:43] <wikibugs>	 (03PS2) 10DDesouza: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092)
[15:10:36] <wikibugs>	 (03PS2) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158)
[15:11:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet
[15:11:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet
[15:11:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet
[15:12:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52228 and previous config saved to /var/cache/conftool/dbconfig/20230831-151205-root.json
[15:12:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1001"
[15:12:38] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1010.eqiad.wmnet with OS bullseye
[15:12:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[15:13:26] <wikibugs>	 (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[15:13:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[15:14:23] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52229 and previous config saved to /var/cache/conftool/dbconfig/20230831-151437-ladsgroup.json
[15:14:47] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet
[15:15:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[15:16:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah)
[15:16:41] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah)
[15:17:43] <wikibugs>	 (03PS5) 10Anzx: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316)
[15:17:53] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah)
[15:21:03] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx)
[15:21:15] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet
[15:22:17] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
[15:22:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet
[15:24:50] <jynus>	 !log extend backup1009 lv by additional 10TiB
[15:24:52] <wikibugs>	 (03PS1) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081
[15:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52230 and previous config saved to /var/cache/conftool/dbconfig/20230831-152710-root.json
[15:27:25] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:28:17] <wikibugs>	 10SRE-tools, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Fabfur)
[15:28:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet
[15:29:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet
[15:29:10] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet
[15:29:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks! Do you want me to merge it?" [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[15:29:29] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup)
[15:29:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52231 and previous config saved to /var/cache/conftool/dbconfig/20230831-152943-ladsgroup.json
[15:29:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[15:29:49] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[15:29:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[15:30:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52232 and previous config saved to /var/cache/conftool/dbconfig/20230831-153005-ladsgroup.json
[15:30:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10VRiley-WMF) db1227 - A 7. U 24.
[15:31:01] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans)
[15:32:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52233 and previous config saved to /var/cache/conftool/dbconfig/20230831-153217-ladsgroup.json
[15:32:21] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup)
[15:35:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet
[15:35:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet
[15:35:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet
[15:36:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet
[15:37:22] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) a:03Trizek-WMF @kamila, thank you for asking for our support.   We have a message ready for commu...
[15:37:40] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) p:05Triage→03High
[15:39:01] <moritzm>	 !log failover ganeti master in ulsfo to ganeti4005
[15:39:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1235 - A 3. U 40. port 34 Cableid 1903
[15:39:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan)
[15:39:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "The patch itself is ready, but we are waiting on ticket for Jobo's ok (to follow procedure)." [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb)
[15:40:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
[15:40:32] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet
[15:42:09] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954068 (owner: 10Muehlenhoff)
[15:42:11] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:57] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[15:44:40] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T336380)
[15:44:46] <stashbot>	 T336380: AQS 2.0: Media Analytics Service - Deploy to staging and production - https://phabricator.wikimedia.org/T336380
[15:45:04] <wikibugs>	 (03PS2) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081
[15:45:06] <wikibugs>	 (03PS1) 10Volans: tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087
[15:45:08] <wikibugs>	 (03PS1) 10Volans: puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088
[15:45:40] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+2] "Adding Editing folks for visibility into this cherry-pick." [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[15:45:57] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T336380)
[15:46:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
[15:46:48] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[15:47:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Fabfur)
[15:47:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52234 and previous config saved to /var/cache/conftool/dbconfig/20230831-154724-ladsgroup.json
[15:48:12] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T336380)
[15:49:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
[15:49:07] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T336380)
[15:49:26] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-2] "oops I shouldn't be +2ing backports." [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[15:49:34] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[15:49:50] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-2] "brain fart and loss of focus .. I shouldn't have been +2ing these." [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[15:51:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:58] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Volans) For context some cookbooks that deems what they are doing dangerous already do that, for example the aforementioned `sre.hosts.reimage`...
[15:52:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088 (owner: 10Volans)
[15:52:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans)
[15:53:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans)
[15:53:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:54:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet
[15:55:17] <wikibugs>	 (03CR) 10Volans: "CI failures are due to https://github.com/pyparsing/pyparsing/issues/501 for which I've sent https://github.com/pyparsing/pyparsing/pull/5" [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans)
[15:55:52] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on cloudservices1006.eqiad.wmnet with reason: service bootstrap
[15:56:06] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cloudservices1006.eqiad.wmnet with reason: service bootstrap
[15:57:09] <wikibugs>	 (03PS3) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081
[15:57:25] <wikibugs>	 (03Abandoned) 10Volans: puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088 (owner: 10Volans)
[15:57:51] <wikibugs>	 (03PS2) 10Majavah: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294)
[15:58:18] <wikibugs>	 (03CR) 10Majavah: team-wmcs: Add Galera checks (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[15:58:20] <wikibugs>	 (03CR) 10Volans: "CI failures are due to https://github.com/pyparsing/pyparsing/issues/501 for which I've sent https://github.com/pyparsing/pyparsing/pull/5" [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans)
[15:59:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:36] <wikibugs>	 (03CR) 10Herron: "opening up for feedback to get the ball rolling.  as-is it is broad in terms of affected hosts, so in addition to feedback on the patch it" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[16:00:54] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Volans) Improving the cookbook outputs and readability of it is surely always a great idea. I'm not sure though what are you proposing as actionable....
[16:02:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
[16:02:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[16:02:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52235 and previous config saved to /var/cache/conftool/dbconfig/20230831-160230-ladsgroup.json
[16:03:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans)
[16:04:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans)
[16:04:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[16:05:40] <wikibugs>	 (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:06:09] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[16:06:52] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans)
[16:09:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans)
[16:09:52] <wikibugs>	 (03PS1) 10Bking: wdqs: re-enable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/954093 (https://phabricator.wikimedia.org/T344518)
[16:13:54] <wikibugs>	 (03PS3) 10Majavah: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294)
[16:14:12] <wikibugs>	 (03CR) 10Majavah: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:17:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52236 and previous config saved to /var/cache/conftool/dbconfig/20230831-161736-ladsgroup.json
[16:17:44] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5018 is DOWN: CRITICAL - Time to live exceeded (10.132.0.18)
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5019 is DOWN: CRITICAL - Time to live exceeded (10.132.0.19)
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5023 is DOWN: CRITICAL - Time to live exceeded (10.132.0.34)
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5028 is DOWN: CRITICAL - Time to live exceeded (10.132.0.25)
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5025 is DOWN: CRITICAL - Time to live exceeded (10.132.0.36)
[16:19:34] <icinga-wm>	 PROBLEM - Host cp5030 is DOWN: CRITICAL - Time to live exceeded (10.132.0.27)
[16:19:35] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: CRITICAL - Time to live exceeded (10.128.128.7)
[16:19:39] <icinga-wm>	 PROBLEM - Host pfw3-codfw #page is DOWN: CRITICAL - Time to live exceeded (208.80.153.197)
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5019 is UP: PING OK - Packet loss = 0%, RTA = 249.02 ms
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5023 is UP: PING OK - Packet loss = 0%, RTA = 235.25 ms
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5025 is UP: PING OK - Packet loss = 0%, RTA = 243.01 ms
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5028 is UP: PING OK - Packet loss = 0%, RTA = 303.26 ms
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5030 is UP: PING OK - Packet loss = 0%, RTA = 242.93 ms
[16:19:50] <icinga-wm>	 RECOVERY - Host cp5018 is UP: PING OK - Packet loss = 0%, RTA = 330.15 ms
[16:19:52] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms
[16:19:55] <icinga-wm>	 RECOVERY - Host pfw3-codfw #page is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[16:20:13] <sukhe>	 hello
[16:21:16] <sukhe>	 hmm recovered so quickly that it didn't page on victorops but that's fine
[16:21:29] <sukhe>	 something did happen here
[16:25:27] <sukhe>	 XioNoX: topranks: ^ sorry for the late ping but this might be worth a look
[16:26:45] <sukhe>	 are we aware of any scheduled maintenance?
[16:27:53] <rzl>	 nothing's on the calendar AFAICT
[16:28:07] <sukhe>	 yeah, nothing to noc@ as well as I can see
[16:29:09] <sukhe>	 the closest thing I see is the revert https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ad0775e516cae00163e4eb0bdf0da1077162d425%5E%21/#F0 but I am not sure how this can be related given that it was already reverted
[16:29:28] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:29:50] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet
[16:30:08] <sukhe>	 208.80.153.220           Down      xe-1/1/1:3.0   6.000     2.000        3   
[16:31:02] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10RZamora-WMF) Thanks for claiming this Phab task 👍
[16:34:21] <wikibugs>	 (03PS2) 10Jelto: miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511)
[16:36:01] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto)
[16:36:44] <topranks>	 sukhe: hey just looking, not aware of anything no 
[16:36:58] <topranks>	 TTL exceeded suggests some routing issue though hmmm 
[16:37:13] <wikibugs>	 (03PS1) 10Majavah: openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294)
[16:37:17] <sukhe>	 yeah... which is I guess what makes me worried, even though it was a flap 
[16:37:35] <topranks>	 I'm only catching up with your later messages, was a transport link flapping?
[16:39:28] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:39:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43097/console" [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:39:39] <topranks>	 ^^^ this was after manual clearing of bfd session
[16:40:08] <sukhe>	 topranks: yeah that was .220 above or xe-4-2-0.cr1-eqiad.wikimedia.org but I am not sure if that's related (that's eqiad -> codfw though?)
[16:40:16] <sukhe>	 the cp5* are eqsin
[16:40:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - cmooney@cumin1001"
[16:41:18] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - cmooney@cumin1001"
[16:41:18] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:41:33] <topranks>	 sukhe: it could possibly be if traffic was getting sent eqiad->codfw and back again due to link flapping 
[16:41:44] <topranks>	 but I've no reason to suspect that for sure 
[16:42:03] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host restbase1030.eqiad.wmnet
[16:42:33] <topranks>	 that link was flapping up/down like mad since 16:11 alright 
[16:42:51] <sukhe>	 ah
[16:43:39] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you @Trizek-WMF ! The message looks good. Maybe I'd suggest replacing the word "first" with "prim...
[16:45:27] <topranks>	 sukhe: I'm gonna assume it was that alright.  Packets seeing best path via that link, then not, then seeing it again 
[16:45:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Thanks, I submitted the on-boarding form, let's see what happens now.
[16:45:39] <topranks>	 we have the same for asw1-ulsfo there as well, so not just eqsin affected 
[16:45:57] <topranks>	 TTL exceeded essentially means packet was in a routing loop 
[16:46:14] <topranks>	 and likely reason for that is link flapping 
[16:47:43] <wikibugs>	 (03PS1) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104
[16:47:48] <sukhe>	 topranks: thanks for checking and confirming
[16:48:01] <topranks>	 np, I'll keep an eye on it, seems stable right now anyway
[16:48:02] <sukhe>	 since it was a flap, do you still think it merits a task? I can file one
[16:48:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah)
[16:48:17] <topranks>	 yeah I'm just doing one here 
[16:48:21] <sukhe>	 <3
[16:48:45] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:49:04] <wikibugs>	 (03CR) 10Ori: [C: 03+1] "I can merge this if you like." [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester)
[16:49:10] <wikibugs>	 (03PS2) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104
[16:49:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah)
[16:49:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] alertmanager: emit helpful info for DatasourceError alerts [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[16:49:54] <icinga-wm>	 PROBLEM - Host restbase1030 is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:20] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T345380 (10phaultfinder)
[16:51:55] <sukhe>	 uh oh, one more
[16:52:45] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:53:29] <wikibugs>	 (03PS3) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104
[16:53:57] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[16:53:58] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:55:21] <sukhe>	 urandom: SAL suggests you were working on restbase1030
[16:55:46] <topranks>	 it's not TTL exceeded like the previous batch anyway (plus in eqiad)
[16:55:56] <urandom>	 sukhe: yes
[16:56:04] <sukhe>	 topranks: yeah, this one is definitely unrelated! 
[16:56:14] <urandom>	 sukhe: why, is it alerting?  I thought I downtimed it.
[16:56:31] <sukhe>	 urandom: no idea, just thought I should let you know in case some action is required :)
[16:56:32] <topranks>	 host down alert above yeah
[16:56:34] <topranks>	 no big deal 
[16:56:45] <sukhe>	 urandom: want me to downtime it again?
[16:57:03] <wikibugs>	 (03PS4) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104
[16:57:12] <urandom>	 sukhe: I just did :/
[16:58:23] <sukhe>	 even the bots are giving up today
[16:58:40] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110
[16:59:10] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43100/console" [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah)
[16:59:52] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110 (owner: 10BryanDavis)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1700)
[17:00:38] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110 (owner: 10BryanDavis)
[17:01:29] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:01:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:02:04] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:02:39] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:02:54] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:03:28] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:07:22] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[17:11:50] <cwhite>	 huh, figured jinxer would rejoin automagically but it appears not configured to do so.  probably will rejoin when an alert fires?
[17:12:21] <RhinosF1|Away>	 cwhite: it did that in #wikimedia-cloud-feed
[17:12:34] <cwhite>	 good to know, thanks :)
[17:12:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a2-codfw.mgmt.codfw.wmnet
[17:16:43] <wikibugs>	 (03PS1) 10Cwhite: alertmanager: add link to DatasourceError runbook [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358)
[17:18:07] <wikibugs>	 (03CR) 10Cwhite: "Any concerns about the overall message length?" [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[17:32:26] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/954114
[17:43:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Adam -- this is prep work for the upcoming OpenStack upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[17:44:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Adam -- this is prep work for the upcoming OpenStack upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[18:00:05] <jouncebot>	 jeena and dduvall: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1800)
[18:04:32] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726)
[18:04:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot)
[18:05:18] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot)
[18:12:01] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.24  refs T343726
[18:12:07] <stashbot>	 T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726
[18:20:01] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:39:55] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296)
[18:40:19] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper)
[18:40:26] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper)
[18:40:32] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper)
[18:44:40] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[18:44:40] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[18:44:50] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[18:44:50] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[18:46:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[18:46:26] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:27] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[18:48:30] <wikibugs>	 (03Abandoned) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[18:49:18] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829)
[18:54:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[18:56:55] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[18:57:32] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[19:03:21] <ryankemper>	 !log T344198 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation
[19:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:33] <stashbot>	 T344198: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198
[19:03:41] <ryankemper>	 !log T344198 on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T344198"'`
[19:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:31] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031)
[19:07:41] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[19:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) wcqs-updater.service Failed on wcqs1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:14:05] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a3-codfw.mgmt.codfw.wmnet
[19:14:07] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[19:14:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (4) wcqs-updater.service Failed on wcqs1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:14:47] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198)
[19:16:03] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper)
[19:17:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper)
[19:21:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński)
[19:28:33] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1005.eqiad.wmnet
[19:30:18] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[19:30:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a3-codfw - cmooney@cumin1001"
[19:33:20] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox
[19:37:43] <wikibugs>	 (03PS7) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960
[19:37:45] <wikibugs>	 (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[19:41:06] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney)
[19:44:35] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[19:45:41] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh6002.wikimedia.org with OS bookworm
[19:45:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh6002.wikimedia.org with OS bookworm
[19:48:02] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Umherirrender) There is a (small) spike in grafana...
[19:48:31] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[19:49:58] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:50:24] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:51:33] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[19:51:41] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[19:53:11] <wikibugs>	 (03CR) 10Dr0ptp4kt: [openstack] remove deprecated option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[19:53:57] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:59:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye
[19:59:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye
[20:00:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye
[20:00:05] <jouncebot>	 brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T2000).
[20:00:05] <jouncebot>	 arlolra, danisztls, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye
[20:00:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye
[20:00:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye
[20:00:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye
[20:00:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:00:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:00:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye
[20:00:46] <MatmaRex>	 hi
[20:01:12] <TheresNoTime>	 I'm unable to deploy (cc brennen)
[20:01:18] <jeena>	 I can deploy
[20:01:19] <thcipriani>	 I can deploy
[20:01:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye
[20:01:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye
[20:01:46] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye
[20:01:47] <thcipriani>	 jeena: we're doing deployment training if'n you're interested in joining :)
[20:01:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:01:53] <jeena>	 okay sure
[20:02:10] <wikibugs>	 (03CR) 10Dr0ptp4kt: New files/templates for OpenStack Antelope (2023.1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[20:03:11] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney)
[20:05:36] <jeena>	 MatmaRex: I can start with yours if you're ready
[20:05:42] <MatmaRex>	 sure
[20:06:02] <jeena>	 arlolra: danisztls hi 
[20:06:07] <arlolra>	 hello
[20:06:12] <danisztls>	 hi
[20:06:32] <jeena>	 I'll continue with your patches after doing MatmaRex's
[20:06:40] <danisztls>	 ok
[20:07:02] <arlolra>	 thanks
[20:07:25] <jeena>	 actually I'll do the config patches first, sorry about that
[20:07:45] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye
[20:07:52] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -...
[20:07:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh6002.wikimedia.org with reason: host reimage
[20:08:39] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński)
[20:09:14] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:09:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza)
[20:09:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza)
[20:10:18] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza)
[20:10:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza)
[20:10:51] <danisztls>	 I will have to rebase my other change
[20:10:54] <jeena>	 thanks
[20:11:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye
[20:11:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye
[20:11:14] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye
[20:11:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:11:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh6002.wikimedia.org with reason: host reimage
[20:11:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[20:11:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:11:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1005.eqiad.wmnet
[20:13:19] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[20:13:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye
[20:13:33] <wikibugs>	 (03PS3) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158)
[20:13:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye
[20:13:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye
[20:13:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:14:20] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T345391 (10RKemper)
[20:14:39] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T345391 (10RKemper)
[20:16:24] <inflatador>	 !log 'bking@wdqs1004 depool wdqs1004 to test script changes T342361'
[20:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:30] <stashbot>	 T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361
[20:17:45] <jeena>	 danisztls: you still want to deploy 954079 in this window, right?
[20:17:57] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:18:05] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:18:06] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10RKemper)
[20:18:08] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10RKemper)
[20:18:32] <danisztls>	 jeena: yes, if possible
[20:18:41] <danisztls>	 already rebased it
[20:18:43] <jeena>	 ok, just making sure
[20:18:47] <jeena>	 great
[20:19:01] <jeena>	 sorry I didn't notice!
[20:19:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza)
[20:19:18] <danisztls>	 np
[20:19:54] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza)
[20:20:09] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]]
[20:20:16] <stashbot>	 T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092
[20:20:16] <stashbot>	 T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158
[20:20:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye
[20:20:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye
[20:20:50] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye
[20:20:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[20:21:04] <wikibugs>	 (03PS1) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126
[20:21:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson)
[20:21:50] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and dani: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:22:21] <jeena>	 danisztls: ready for you to do any checks on mwdebug before syncing
[20:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński)
[20:23:07] <wikibugs>	 (03PS1) 10Bking: Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661
[20:23:12] <MatmaRex>	 jeena: i don't really want to test this in production with my IP address, i tested locally earlier though
[20:23:37] <danisztls>	 jeena: first change looks good
[20:23:42] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661 (owner: 10Bking)
[20:23:54] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661 (owner: 10Bking)
[20:23:58] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:23:59] <jeena>	 MatmaRex: 👍
[20:24:45] <wikibugs>	 (03PS2) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126
[20:25:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson)
[20:25:40] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:26:07] <jeena>	 danisztls: how about the second one?
[20:26:25] <wikibugs>	 (03PS3) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126
[20:26:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:26:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:26:40] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:27:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:27:14] <danisztls>	 jeena: not
[20:27:15] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:27:41] <danisztls>	 possible because messages haven't been created yet
[20:27:50] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[20:28:02] <jeena>	 is it okay to sync?
[20:28:03] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:28:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:28:17] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1030.eqiad.wmnet']
[20:28:48] <danisztls>	 yeah, coverage is 0
[20:28:56] <jeena>	 okay thanks
[20:29:07] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and dani: Continuing with sync
[20:29:41] <wikibugs>	 (03PS4) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126
[20:32:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh6002.wikimedia.org with OS bookworm
[20:32:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh6002.wikimedia.org with OS bookworm completed: - doh6002 (**WARN**)   - Downtimed on Icinga/Al...
[20:34:29] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]] (duration: 14m 19s)
[20:34:35] <stashbot>	 T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092
[20:34:36] <stashbot>	 T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158
[20:34:43] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder backups: move paws to cloudbackup2002; backup life to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/954130
[20:34:45] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-backup: support removal of unhandled image backups [puppet] - 10https://gerrit.wikimedia.org/r/954131
[20:35:06] <jeena>	 danisztls: all synced
[20:35:14] <jeena>	 MatmaRex: starting yours now
[20:35:27] <danisztls>	 jeena: thanks!
[20:36:12] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]]
[20:36:18] <stashbot>	 T345031: New messages notification cannot be dismissed by unregistered users - https://phabricator.wikimedia.org/T345031
[20:36:21] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[20:37:38] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and matmarex: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:37:55] <logmsgbot>	 !log jhuneidi@deploy1002 jhuneidi and matmarex: Continuing with sync
[20:39:21] <wikibugs>	 (03PS1) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874)
[20:42:17] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:43:14] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]] (duration: 07m 01s)
[20:43:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2001.codfw.wmnet
[20:43:20] <stashbot>	 T345031: New messages notification cannot be dismissed by unregistered users - https://phabricator.wikimedia.org/T345031
[20:43:23] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:43:37] <jeena>	 MatmaRex: synced
[20:43:39] <MatmaRex>	 thanks jeena
[20:43:48] <wikibugs>	 (03CR) 10Ebernhardson: "PCC is failing for a real problem in the existing common.yaml. The problem is new zookeeper instances are being added and they have been d" [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson)
[20:44:12] <jeena>	 arlolra: still there?
[20:44:17] <arlolra>	 yup
[20:44:26] <arlolra>	 😅
[20:44:31] <jeena>	 👍 starting yours now
[20:44:37] <arlolra>	 thanks
[20:45:05] <jeena>	 I submitted it already to speed up a little
[20:45:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[20:45:42] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh6002.wikimedia.org
[20:45:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh6002.wikimedia.org
[20:46:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:46:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh5002.wikimedia.org with OS bookworm
[20:47:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh5002.wikimedia.org with OS bookworm
[20:47:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:50:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[20:50:59] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:51:01] <wikibugs>	 (03Merged) 10jenkins-bot: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra)
[20:51:14] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]]
[20:51:20] <stashbot>	 T339365: Fix Parsoid metrics - https://phabricator.wikimedia.org/T339365
[20:51:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[20:51:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:51:45] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2001.codfw.wmnet
[20:52:07] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:52:38] <logmsgbot>	 !log jhuneidi@deploy1002 arlolra and jhuneidi: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:53:02] <jeena>	 arlolra: ready for you to check on mwdebug
[20:53:10] <arlolra>	 alrighty
[20:53:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:47] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:40] <arlolra>	 jeena: looks good
[20:55:50] <jeena>	 👍
[20:55:55] <logmsgbot>	 !log jhuneidi@deploy1002 arlolra and jhuneidi: Continuing with sync
[20:57:21] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[20:57:27] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[21:00:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2003.codfw.wmnet
[21:01:18] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] (duration: 10m 03s)
[21:01:25] <stashbot>	 T339365: Fix Parsoid metrics - https://phabricator.wikimedia.org/T339365
[21:01:52] <arlolra>	 thank you jeena 
[21:02:03] <jeena>	 you're welcome!
[21:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:02:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10nshahquinn-wmf) >>! In T158757#9133055, @jbond wrote: > Its worth noting that once services have been migrated to the new puppet7 infrastructure then agent certificates...
[21:04:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[21:06:13] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[21:07:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[21:07:22] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:07:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:07:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: move paws to cloudbackup2002; backup life to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/954130 (owner: 10Andrew Bogott)
[21:08:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[21:08:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:08:19] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2003.codfw.wmnet
[21:09:50] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:45] <wikibugs>	 (03PS1) 10Bking: flink-zk: Move codfw hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/954134 (https://phabricator.wikimedia.org/T341792)
[21:11:00] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:13:29] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2037.codfw.wmnet with OS bullseye
[21:13:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[21:25:27] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye
[21:25:34] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -...
[21:35:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5002.wikimedia.org with reason: host reimage
[21:38:24] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5002.wikimedia.org with reason: host reimage
[22:13:10] <wikibugs>	 (03PS2) 10Caenus: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315)
[22:15:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:15:47] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:17:05] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5002.wikimedia.org with OS bookworm
[22:17:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh5002.wikimedia.org with OS bookworm completed: - doh5002 (**PASS**)   - Downtimed on Icinga/Al...
[22:17:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[22:58:59] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[23:04:42] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:10:54] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye
[23:11:01] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye
[23:15:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[23:15:15] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[23:16:26] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart
[23:20:47] <wikibugs>	 (03CR) 10Cwhite: "Thanks for putting this together!" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[23:25:35] <wikibugs>	 (03PS1) 10Andrea Denisse: librenms: Add PHP version for Debian Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/954143 (https://phabricator.wikimedia.org/T344136)
[23:46:10] <wikibugs>	 (03PS3) 10Tim Starling: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200)
[23:53:08] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye
[23:53:15] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -...
[23:54:40] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling)
[23:55:22] <wikibugs>	 (03Merged) 10jenkins-bot: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling)