[00:10:43] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye
[00:10:58] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host centrallog1002.eqiad.wmnet with OS bullseye
[00:29:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove ecs 1.11.0-2 template [puppet] - 10https://gerrit.wikimedia.org/r/882781 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[00:31:37] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps: Add missing index script on import [puppet] - 10https://gerrit.wikimedia.org/r/883197 (owner: 10Jgiannelos)
[00:33:49] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: Add centrallog1002 as logsource for logstash tests [puppet] - 10https://gerrit.wikimedia.org/r/882762 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[00:34:06] <wikibugs>	 (03PS3) 10Acamicamacaraca: Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833)
[00:59:27] <wikibugs>	 (03CR) 10Cwhite: [V: 04-1 C: 04-1] "See inline." [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[01:17:07] <legoktm>	 !log adjusting Gerrit group "Campaigns Team" so it is not recursively a member of itself
[01:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:39] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Sotiale) Hi. Langcom is discussing this and is wondering how we can respond to the existing interwiki content...
[01:55:15] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Ladsgroup) You can use global search: https://global-search.toolforge.org/?q=%22%5B%5B%3Aminnan%3A%22&namespa...
[02:07:47] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:47] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:43:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:51:24] <wikibugs>	 (03PS3) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782)
[04:55:34] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:57] <wikibugs>	 (03PS11) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[05:44:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[05:53:00] <wikibugs>	 (03PS12) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[05:56:55] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39240/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[06:00:47] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376)
[06:10:44] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:17:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:25:18] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0700)
[07:00:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:19:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) Thanks John. I am going to reclone this host.
[07:20:19] <wikibugs>	 (03PS1) 10Marostegui: db1166: Disable notifcations [puppet] - 10https://gerrit.wikimedia.org/r/883304
[07:20:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 to clone db1198', diff saved to https://phabricator.wikimedia.org/P43320 and previous config saved to /var/cache/conftool/dbconfig/20230125-072033-marostegui.json
[07:20:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1166: Disable notifcations [puppet] - 10https://gerrit.wikimedia.org/r/883304 (owner: 10Marostegui)
[07:26:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:31:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33
[07:31:56] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33
[07:34:15] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@bfff15d]: (no justification provided)
[07:34:21] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bfff15d]: (no justification provided) (duration: 00m 05s)
[07:45:36] <wikibugs>	 (03PS1) 10Marostegui: db1206,db1196: Switch sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883496 (https://phabricator.wikimedia.org/T327859)
[07:46:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206 to clone db1196 T327859', diff saved to https://phabricator.wikimedia.org/P43322 and previous config saved to /var/cache/conftool/dbconfig/20230125-074601-marostegui.json
[07:46:05] <stashbot>	 T327859: Switch s1 sanitarium master from db1206 to db1196 - https://phabricator.wikimedia.org/T327859
[07:46:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206,db1196: Switch sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883496 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui)
[07:48:38] <wikibugs>	 (03PS1) 10Ayounsi: Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532)
[07:49:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[07:49:18] <wikibugs>	 (03PS1) 10Marostegui: db1196: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883498 (https://phabricator.wikimedia.org/T327859)
[07:49:35] <marostegui>	 !log Cloning db1196 from db1206 (lag will appear on s1 wiki replicas) T327859
[07:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1196: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883498 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui)
[07:50:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:50:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:51:22] <wikibugs>	 (03PS2) 10Ayounsi: Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532)
[07:54:41] <wikibugs>	 (03PS1) 10Marostegui: db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883499 (https://phabricator.wikimedia.org/T327800)
[07:55:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883499 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui)
[07:59:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[08:00:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0800).
[08:00:05] <jouncebot>	 Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:13] <Aca>	 Hello! Confirming my presence here
[08:01:47] <Amir1>	 let me check
[08:02:22] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca)
[08:02:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca)
[08:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca)
[08:03:41] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]]
[08:03:45] <stashbot>	 T327833: Add sandbox link to Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327833
[08:03:52] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861)
[08:03:57] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861)
[08:03:57] <Aca>	 Should I open the Debug tool now?
[08:04:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi)
[08:05:34] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and aleksandar: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[08:07:10] <Aca>	 Working as expected. Link is shown in the user menu now.
[08:07:57] <Amir1>	 awesome moving forward
[08:10:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500
[08:13:10] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover date" [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[08:13:13] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover date" [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[08:13:55] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]] (duration: 10m 13s)
[08:13:59] <stashbot>	 T327833: Add sandbox link to Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327833
[08:14:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "I am removing the grants from the DB now. So feel free to merge this as you wish" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[08:17:16] <Amir1>	 Aca: It's done
[08:17:17] <Aca>	 The change is live now. Thanks, Amir1!
[08:20:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] changeprop: use wmf-certificates instead of puppet_ca_crt (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan)
[08:26:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/881386 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:31:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs-idmapd [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991)
[08:34:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[08:34:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nfs-idmapd [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:39:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for blkmapd [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991)
[08:40:25] <XioNoX>	 !log bump SGIX max prefix limit
[08:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for blkmapd [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:44:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs.mountd [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991)
[08:51:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10Peachey88)
[08:53:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nfs.mountd [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:59:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: MIgrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff)
[08:59:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544)
[08:59:57] <wikibugs>	 (03PS13) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[08:59:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10Peachey88)
[09:00:04] <jouncebot>	 brennen and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0900)
[09:00:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Awesome! Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi)
[09:01:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto)
[09:10:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:10:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:11:26] <wikibugs>	 (03PS14) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[09:14:51] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39242/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[09:15:38] <wikibugs>	 (03CR) 10Deni: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca)
[09:30:29] <Emperor>	 !log rolling depool & update of thanos front-ends T327871
[09:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:20] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetdb: No longer use the component on booworm [puppet] - 10https://gerrit.wikimedia.org/r/881598
[09:35:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1196 to sanitarium master in s1 [puppet] - 10https://gerrit.wikimedia.org/r/883527 (https://phabricator.wikimedia.org/T327859)
[09:35:57] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1196 to sanitarium master in s1 [puppet] - 10https://gerrit.wikimedia.org/r/883527 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui)
[09:38:48] <wikibugs>	 (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883528
[09:39:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883528 (owner: 10Marostegui)
[09:39:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P43325 and previous config saved to /var/cache/conftool/dbconfig/20230125-093918-root.json
[09:40:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43326 and previous config saved to /var/cache/conftool/dbconfig/20230125-094029-root.json
[09:42:14] <wikibugs>	 (03PS1) 10Marostegui: db1166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883529
[09:42:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883529 (owner: 10Marostegui)
[09:46:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: No longer use the component on booworm [puppet] - 10https://gerrit.wikimedia.org/r/881598 (owner: 10Muehlenhoff)
[09:47:07] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:47:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223
[09:49:19] <icinga-wm>	 PROBLEM - DPKG on thanos-fe1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:49:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi)
[09:49:51] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Thanos
[09:49:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi)
[09:51:09] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:23] <godog>	 ryankemper herron FYI ^ reverted the wdqs slo rules, due to invalid syntax, sorry ATM I don't have time to look into it further
[09:52:45] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:57] <wikibugs>	 (03PS1) 10Marostegui: db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883535 (https://phabricator.wikimedia.org/T327859)
[09:53:07] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Thanos
[09:53:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883535 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui)
[09:54:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P43327 and previous config saved to /var/cache/conftool/dbconfig/20230125-095400-root.json
[09:54:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P43328 and previous config saved to /var/cache/conftool/dbconfig/20230125-095423-root.json
[09:55:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[09:55:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43329 and previous config saved to /var/cache/conftool/dbconfig/20230125-095534-root.json
[09:58:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans)
[09:58:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: force a newer sphinx_rtd_theme [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans)
[09:59:36] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:59:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:00:42] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Thanos
[10:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: force a newer sphinx_rtd_theme [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans)
[10:03:08] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:10] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:48] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Thanos
[10:07:51] <wikibugs>	 (03PS1) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538
[10:08:14] <wikibugs>	 (03PS1) 10Jelto: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919)
[10:08:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: use wmf-certificates instead of puppet_ca_crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan)
[10:09:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P43330 and previous config saved to /var/cache/conftool/dbconfig/20230125-100904-root.json
[10:09:10] <wikibugs>	 (03PS1) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/883540
[10:09:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P43331 and previous config saved to /var/cache/conftool/dbconfig/20230125-100928-root.json
[10:10:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43332 and previous config saved to /var/cache/conftool/dbconfig/20230125-101039-root.json
[10:14:43] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: use wmf-certificates instead of puppet_ca_crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan)
[10:18:27] <elukey>	 hnowlan: \o/
[10:18:34] <wikibugs>	 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi)
[10:19:38] <icinga-wm>	 RECOVERY - DPKG on thanos-fe1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:21:17] <wikibugs>	 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi) p:05Triage→03Low
[10:22:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service: Map memberOf to OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[10:24:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P43333 and previous config saved to /var/cache/conftool/dbconfig/20230125-102409-root.json
[10:24:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P43334 and previous config saved to /var/cache/conftool/dbconfig/20230125-102433-root.json
[10:25:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43335 and previous config saved to /var/cache/conftool/dbconfig/20230125-102544-root.json
[10:36:41] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:37:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:38:41] <wikibugs>	 (03CR) 10Elukey: "John I changed the name of the CAs to reflect more that liftwing == ml-serve k8s clusters (after chatting with Janis). Lemme know if the +" [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:39:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P43336 and previous config saved to /var/cache/conftool/dbconfig/20230125-103914-root.json
[10:39:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P43337 and previous config saved to /var/cache/conftool/dbconfig/20230125-103938-root.json
[10:39:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Talked to John about it, we'll try to get it done this week :)
[10:40:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43338 and previous config saved to /var/cache/conftool/dbconfig/20230125-104049-root.json
[10:41:36] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10Clement_Goubert) >>! In T239862#8504801, @LSobanski wrote: > @Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP conf...
[10:43:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:47:21] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) 05Open→03In progress
[10:47:33] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) I "documented" how I did it in case it is useful and for sanity check: {P43339}
[10:48:40] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:48:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:49:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:49:43] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[10:51:45] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:54:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P43340 and previous config saved to /var/cache/conftool/dbconfig/20230125-105419-root.json
[10:54:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P43341 and previous config saved to /var/cache/conftool/dbconfig/20230125-105443-root.json
[10:54:46] <hnowlan>	 !log restarting lvs on lvs2010 for thumbor healthcheck change 
[10:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43342 and previous config saved to /var/cache/conftool/dbconfig/20230125-105554-root.json
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1100)
[11:00:41] <hnowlan>	 !log restarting lvs on lvs1010 for thumbor healthcheck change 
[11:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:52] <hnowlan>	 !log restarting lvs on lvs1020 for thumbor healthcheck change 
[11:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:57] <hnowlan>	 sigh
[11:02:50] <wikibugs>	 (03PS1) 10Sergio Gimeno: User impact: ammend incorrect  parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824)
[11:08:24] <hnowlan>	 !log restarting lvs on lvs2009 for thumbor healthcheck change 
[11:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P43343 and previous config saved to /var/cache/conftool/dbconfig/20230125-110924-root.json
[11:10:05] <wikibugs>	 (03PS1) 10Sergio Gimeno: User impact: ammend incorrect  parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883548 (https://phabricator.wikimedia.org/T327824)
[11:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:11:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43344 and previous config saved to /var/cache/conftool/dbconfig/20230125-111059-root.json
[11:11:48] <wikibugs>	 (03Abandoned) 10Sergio Gimeno: User impact: ammend incorrect  parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883548 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno)
[11:12:09] <hnowlan>	 !log restarting lvs on lvs1019 for thumbor healthcheck change 
[11:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:54] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:19:45] <wikibugs>	 (03PS2) 10Kosta Harlan: User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno)
[11:20:14] <Lucas_WMDE>	 jouncebot: now
[11:20:14] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1100)
[11:20:36] <Lucas_WMDE>	 I’d like to run a maintenance script for Wikidata – shout if I shouldn’t, otherwise I’ll go ahead in 5 minutes or so :)
[11:21:23] <wikibugs>	 (03PS1) 10Jakob: REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490)
[11:22:06] <wikibugs>	 (03CR) 10Silvan Heintze: [C: 03+1] "thanks!" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[11:25:59] <Lucas_WMDE>	 ok, I’m starting the maintenance script for T325942
[11:25:59] <stashbot>	 T325942: Christmas 2022 wbs_propertypairs table update on Wikidata - https://phabricator.wikimedia.org/T325942
[11:26:06] <Lucas_WMDE>	 shouldn’t take too long, a few minutes according to the docs
[11:26:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr)
[11:27:18] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:29:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) @Papaul  is there a day that you can assist me with making network changes to complete this?
[11:30:47] <wikibugs>	 (03PS5) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799)
[11:31:19] <wikibugs>	 (03PS6) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799)
[11:31:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet
[11:32:54] <wikibugs>	 (03CR) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[11:32:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) 05Open→03Resolved This is complete @BTullis
[11:33:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10BTullis) Great. Many thanks.
[11:34:45] <Lucas_WMDE>	 !log Updated the Wikidata property suggester with data from 20230102's JSON dump (T325942)
[11:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:49] <stashbot>	 T325942: Christmas 2022 wbs_propertypairs table update on Wikidata - https://phabricator.wikimedia.org/T325942
[11:34:49] * Lucas_WMDE done
[11:37:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet
[11:38:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet
[11:41:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet
[11:53:08] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[11:56:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, just make sure that the generated data is indeed what expected when deploying it." [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[11:58:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install1004.wikimedia.org
[11:58:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:05:30] <wikibugs>	 (03CR) 10Volans: "couple of comments inline" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[12:10:28] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): REST: Use error log level for unexpected errors (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[12:12:01] <moritzm>	 !log installing libtasn security updates on buster
[12:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1004.wikimedia.org - jmm@cumin2002"
[12:13:19] <wikibugs>	 (03CR) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[12:13:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1004.wikimedia.org - jmm@cumin2002"
[12:13:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:13:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install1004.wikimedia.org on all recursors
[12:13:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install1004.wikimedia.org on all recursors
[12:14:38] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883227 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff)
[12:15:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10santhosh)
[12:15:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye
[12:15:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye
[12:19:42] <wikibugs>	 (03PS2) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756)
[12:20:08] <wikibugs>	 (03CR) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[12:23:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert
[12:27:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install1004.wikimedia.org
[12:32:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783)
[12:33:36] <wikibugs>	 (03PS9) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945)
[12:33:52] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update staging image with new nltk dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/883559
[12:34:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff)
[12:35:00] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[12:37:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install2004.wikimedia.org
[12:37:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:41:50] <moritzm>	 !log restarting slapd on r/w servers to pick up new libtasn
[12:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:51] <wikibugs>	 (03PS10) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277)
[12:42:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Clement_Goubert) FYI this created warnings in `cross-validate-accounts`, CR incoming.
[12:42:57] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe1002.eqiad.wmnet
[12:43:04] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe1003.eqiad.wmnet
[12:43:46] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe2002.codfw.wmnet
[12:43:50] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe2003.codfw.wmnet
[12:44:10] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883562 (https://phabricator.wikimedia.org/T305979) (owner: 10Clément Goubert)
[12:44:14] <godog>	 FWIW I did the above because thanos-web is behind SSO and only one host can be pooled at the time (sso sessions are not shared)
[12:44:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10MoritzMuehlenhoff) >>! In T305979#8557028, @Clement_Goubert wrote: > FYI this created warnings in `cross-valid...
[12:44:20] <godog>	 bummer I know :(
[12:45:14] <moritzm>	 !log restarting Exim on MXes to pick up new libtasn
[12:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:47:59] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert)
[12:49:02] <wikibugs>	 (03PS3) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783)
[12:50:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff)
[12:50:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:50:32] <wikibugs>	 (03Abandoned) 10Clément Goubert: openldap: Add mwdebuggers to cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/883562 (https://phabricator.wikimedia.org/T305979) (owner: 10Clément Goubert)
[12:50:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff)
[12:51:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Clement_Goubert) >>! In T305979#8557029, @MoritzMuehlenhoff wrote: >>>! In T305979#8557028, @Clement_Goubert w...
[12:53:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] profile::pki::root_ca: add new intermediates for liftwing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[12:53:10] <wikibugs>	 (03CR) 10Herron: "Thx for addressing this, will take a closer look" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi)
[12:54:01] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[12:54:23] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 21s)
[12:54:59] <jbond>	 !log disable puppet fleet wide to deploy gerrit:883233
[12:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff)
[12:57:02] <wikibugs>	 (03PS1) 10Hnowlan: imagemagick: use JSON output from exiftool [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887)
[12:59:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2004.wikimedia.org - jmm@cumin2002"
[12:59:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Puppetfile: order puppet file and add some addtional notes [puppet] - 10https://gerrit.wikimedia.org/r/883232 (owner: 10Jbond)
[12:59:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] augeas_core: add augeas core module to the vendor modules [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[13:00:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2004.wikimedia.org - jmm@cumin2002"
[13:00:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:00:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install2004.wikimedia.org on all recursors
[13:00:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install2004.wikimedia.org on all recursors
[13:04:35] <jbond>	 !log enable puppet fleet wide to post deploy gerrit:883233
[13:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:55] <jbond>	 !log puppet now using vendored version of augeas-core https://gerrit.wikimedia.org/r/c/operations/puppet/+/883233
[13:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:40] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4037.ulsfo.wmnet with OS bullseye
[13:11:45] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors: - cp4037 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[13:14:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install2004.wikimedia.org
[13:17:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking)
[13:17:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Pulled drive  will advise when it can be reinserted
[13:18:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:18:29] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:19:19] <wikibugs>	 (03PS1) 10Ayounsi: Stop using profile::contact [puppet] - 10https://gerrit.wikimedia.org/r/883565
[13:20:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack::haproxy::site: don't provision backend FW rules [puppet] - 10https://gerrit.wikimedia.org/r/868070 (owner: 10Majavah)
[13:21:42] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:22:26] <wikibugs>	 (03PS1) 10Muehlenhoff: prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566
[13:26:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install3002.wikimedia.org
[13:26:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:27:41] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[13:27:44] <volans>	 hashar: any maintenace ongoing on gerrit? seem down for multiple of us
[13:27:52] <volans>	 hello icinga-wm, just in time
[13:27:58] <volans>	 cc slyngs, _joe_ 
[13:28:49] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[13:29:08] <slyngs>	 volans: Looking
[13:30:17] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67102 bytes in 0.043 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[13:30:23] <wikibugs>	 (03CR) 10Ayounsi: "The error doesn't seem to be related to this change, but to the ping1002 decom." [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:30:47] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[13:30:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3002.wikimedia.org - jmm@cumin2002"
[13:31:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3002.wikimedia.org - jmm@cumin2002"
[13:31:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:31:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install3002.wikimedia.org on all recursors
[13:31:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install3002.wikimedia.org on all recursors
[13:32:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:32:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:32:47] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:34:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "That's not the correct regex" [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff)
[13:34:36] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T327902 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[13:34:40] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10ops-monitoring-bot)
[13:35:02] <wikibugs>	 (03CR) 10Volans: "thanks for the fix!" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[13:35:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Stop using profile::contact [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:36:09] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10Marostegui)
[13:36:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui)
[13:36:13] <wikibugs>	 (03PS2) 10Muehlenhoff: prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566
[13:36:16] <wikibugs>	 (03CR) 10Muehlenhoff: prospector: Allow longer variable names (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff)
[13:38:15] <wikibugs>	 (03CR) 10Muehlenhoff: "Shouldn't we also remove profile::contact itself?" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi)
[13:38:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output...
[13:38:47] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10Marostegui) 05Open→03Declined This is part of a test T325046
[13:38:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui)
[13:39:10] <wikibugs>	 (03CR) 10Jakob: REST: Use error log level for unexpected errors (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[13:39:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10MoritzMuehlenhoff) >>! In T325046#8557233, @Marostegui wrote: > @Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly...
[13:42:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: metadata: allow haproxy backend connections [puppet] - 10https://gerrit.wikimedia.org/r/883571
[13:46:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install3002.wikimedia.org
[13:48:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Papaul) @Jclark-ctr is this Friday 9:30 amd CT works for you? Also before that can you update the task whit the exact rack location where the servers are moving?
[13:49:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Thanks Moritz, do you need the disk to be left out?
[13:49:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/883571/39244/" [puppet] - 10https://gerrit.wikimedia.org/r/883571 (owner: 10Arturo Borrero Gonzalez)
[13:50:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org
[13:50:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4002.wikimedia.org
[13:51:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org
[13:51:10] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4002.wikimedia.org
[13:51:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "(good to deploy as far as I’m concerned)" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[13:52:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: nova: metadata: allow haproxy backend connections [puppet] - 10https://gerrit.wikimedia.org/r/883571 (owner: 10Arturo Borrero Gonzalez)
[13:56:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) We chatted on IRC and we are leaving the disk on a failed state for now until @MoritzMuehlenhoff is done with his tests.
[13:56:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10MoritzMuehlenhoff) >>! In T325046#8557292, @Marostegui wrote: > Thanks Moritz, do you need the disk to be left out?   Yeah, let's keep it for a few d...
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1400).
[14:00:05] <jouncebot>	 jakob_WMDE, MichaelG_WMDE, Aca, and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:08] <Aca>	 Confirming my presence. Hello once again! :)
[14:00:09] * MichaelG_WMDE waves 👋
[14:00:13] <urbanecm>	 I can deploy today!
[14:00:15] <sergi0_>	 hi
[14:00:16] <urbanecm>	 hello everyone
[14:00:21] <jakob_WMDE>	 hi!
[14:00:22] <urbanecm>	 hey sergi0_
[14:00:29] <urbanecm>	 hi jakob_WMDE!
[14:00:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[14:00:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno)
[14:01:04] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) The above patch should cause native lazy loading of images by the browser. This w...
[14:01:23] <urbanecm>	 MichaelG_WMDE: hi, are you around? :) or will jakob_WMDE handle the config patc htoo?
[14:01:36] <MichaelG_WMDE>	 I'm around :)
[14:01:45] <MichaelG_WMDE>	 And I think we both are able to do so
[14:01:54] <MichaelG_WMDE>	 but we should do jakob's change first
[14:02:09] <Lucas_WMDE>	 I’m busy for now, can deploy later if needed
[14:02:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "id is free (https://sh.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces), otherwise LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca)
[14:02:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca)
[14:02:49] <urbanecm>	 MichaelG_WMDE: so, the config change depends on the backport?
[14:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Draft namespace on Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca)
[14:03:08] <MichaelG_WMDE>	 technically no
[14:03:31] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]]
[14:03:34] <urbanecm>	 okay
[14:03:35] <stashbot>	 T327864: Enable Draft namespace on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327864
[14:03:36] <MichaelG_WMDE>	 the backport raises error logging-level for the API endpoint that is enabled by the config change
[14:03:46] <urbanecm>	 gotcha
[14:04:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5002.wikimedia.org
[14:05:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:05:20] <logmsgbot>	 !log urbanecm@deploy1002 aleksandar and urbanecm: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[14:05:34] <urbanecm>	 Aca: your change is available for testing at mwdebug1001. can you check?
[14:05:40] <Aca>	 Yep. On it
[14:07:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye
[14:07:40] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye
[14:08:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5002.wikimedia.org - jmm@cumin2002"
[14:08:54] <Aca>	 Working as expected. Draft namespace is now identified as a separate namespace in the lists and on the page info.
[14:09:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5002.wikimedia.org - jmm@cumin2002"
[14:09:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:09:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5002.wikimedia.org on all recursors
[14:09:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5002.wikimedia.org on all recursors
[14:10:08] <urbanecm>	 Aca: thanks, syncing
[14:10:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update staging image with new nltk dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/883559 (owner: 10Ilias Sarantopoulos)
[14:16:15] <wikibugs>	 (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[14:16:21] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:30] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]] (duration: 12m 59s)
[14:16:34] <stashbot>	 T327864: Enable Draft namespace on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327864
[14:16:38] <urbanecm>	 Aca: your change should be live
[14:16:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[14:16:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno)
[14:17:13] <Aca>	 Yeah, it is. Thank you! Have a nice day y'all!
[14:17:33] <urbanecm>	 you too!
[14:18:00] <wikibugs>	 (03Merged) 10jenkins-bot: REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob)
[14:20:43] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno)
[14:21:10] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]]
[14:21:15] <stashbot>	 T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490
[14:21:16] <stashbot>	 T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3  - https://phabricator.wikimedia.org/T327824
[14:21:17] <urbanecm>	 finally :)
[14:21:50] <jakob_WMDE>	 urbanecm: fyi as MichaelG_WMDE said the wikibase backport is a trivial patch to raise the log level for REST API errors. not possible to verify on its own, but we'll see whether everything works once we flip the config switch
[14:21:58] <urbanecm>	 gotcha
[14:22:10] <urbanecm>	 I'll just deploy it together with sergi0_'s backport
[14:22:14] <wikibugs>	 (03PS3) 10BBlack: Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285)
[14:23:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:23:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install5002.wikimedia.org
[14:24:07] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) Please note all 5 of these host are old HP ProLiants.  I'm not sure where this setting is on these hosts, but I'm assuming in the bios and each of these will require downtime/reboot to disable....
[14:24:55] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH)
[14:25:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install6002.wikimedia.org
[14:25:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:27:05] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285) (owner: 10BBlack)
[14:28:39] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage
[14:29:00] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:29:08] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi) Note that it's on their IPMI/ILO interfaces, not sure they need to go down.
[14:29:10] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:29:20] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:29:27] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:29:37] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:29:47] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:29:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6002.wikimedia.org - jmm@cumin2002"
[14:29:55] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:30:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye
[14:30:07] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye
[14:30:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6002.wikimedia.org - jmm@cumin2002"
[14:30:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:30:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install6002.wikimedia.org on all recursors
[14:30:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install6002.wikimedia.org on all recursors
[14:32:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage
[14:32:44] <wikibugs>	 (03PS1) 10Jbond: idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580
[14:34:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond)
[14:35:16] <urbanecm>	 scap takes an ethernity...
[14:36:29] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I did a second rclone run on 24 Jan, hoping that entries in that list that weren't in the 23 Jan list would be enlightening. Extracting the copy list as before, then: ` join...
[14:39:58] <logmsgbot>	 !log urbanecm@deploy1002 jakob and sgimeno and urbanecm: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:40:03] <urbanecm>	 finally
[14:40:04] <stashbot>	 T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490
[14:40:04] <stashbot>	 T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3  - https://phabricator.wikimedia.org/T327824
[14:40:13] <urbanecm>	 sergi0_: can you check at mwdebug1001?
[14:40:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10BTullis) @Jclark-ctr - sorry to trouble you, but do you know when you might be able to replace the batteries in these three hosts?...
[14:41:00] <sergi0_>	 sure
[14:42:50] <sergi0_>	 idk what did not went well but I can't see the text change :(
[14:44:14] <urbanecm>	 ah, it's an i18n change...
[14:44:17] <sergi0_>	 anything specific to check for a i18n change like this one?
[14:44:19] <urbanecm>	 ...that i'll need to do a full scap sync
[14:44:22] <urbanecm>	 which takes an hour
[14:44:39] <urbanecm>	 well, what can i do :)
[14:44:48] <urbanecm>	 i'll proceed now, and then do a full scap afterwards to make it into effect
[14:44:50] <sergi0_>	 Sorry, should have warned you before
[14:44:58] <urbanecm>	 no worries, i should've noticed
[14:45:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install6002.wikimedia.org
[14:45:28] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[14:47:40] <MichaelG_WMDE>	 I only just now noticed that the change still had my -1 from Monday. We had the required meeting yesterday where it was green-lit and so it is good to go :)
[14:50:40] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage
[14:52:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867)
[14:53:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage
[14:53:31] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]] (duration: 32m 21s)
[14:53:36] <stashbot>	 T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490
[14:53:36] <stashbot>	 T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3  - https://phabricator.wikimedia.org/T327824
[14:53:36] <urbanecm>	 finally!
[14:53:37] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582
[14:53:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[14:53:44] <urbanecm>	 my longest scap backport ever so far
[14:54:01] <wikibugs>	 (03PS2) 10Urbanecm: Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[14:54:11] <urbanecm>	 doing the config patch now
[14:54:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[14:54:22] * MichaelG_WMDE is here for it :)
[14:55:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große)
[14:55:27] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]]
[14:55:32] <stashbot>	 T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999
[14:57:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye
[14:57:20] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and migr: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:57:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye completed: - cp4037 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[14:57:30] <urbanecm>	 MichaelG_WMDE: your patch is at mwdebug1001, can you test?
[14:57:46] <MichaelG_WMDE>	 I'll have a look!
[14:58:08] <jakob_WMDE>	 working for me! :)
[14:58:20] <MichaelG_WMDE>	 works for me, too!
[14:58:29] <urbanecm>	 so, let's sync?
[14:58:35] <jakob_WMDE>	 yes \o/
[14:58:41] <urbanecm>	 doing!
[14:58:41] <MichaelG_WMDE>	 yep, let's go
[14:59:31] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) 05Open→03In progress p:05Triage→03Medium
[14:59:58] <wikibugs>	 (03PS1) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783)
[15:00:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:01:01] <urbanecm>	 !log Overrunning B&C window
[15:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:20] <wikibugs>	 (03PS2) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783)
[15:02:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=cdn
[15:02:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-be
[15:04:11] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]] (duration: 08m 43s)
[15:04:14] <wikibugs>	 (03CR) 10Btullis: "I think that you also need to make similar chanes in hieradata/role/common/analytics_cluster/coordinator.yaml for an-coord1001" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:04:15] <stashbot>	 T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999
[15:04:19] <urbanecm>	 MichaelG_WMDE: and it's live :)
[15:04:24] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: triggering i18n refresh for T327824
[15:04:24] <wikibugs>	 (03PS2) 10Jbond: idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580
[15:04:27] <stashbot>	 T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3  - https://phabricator.wikimedia.org/T327824
[15:04:31] <urbanecm>	 doing the i18n rebuild now
[15:04:56] <MichaelG_WMDE>	 yay!
[15:05:22] <urbanecm>	 fyi sergi0_ ^^
[15:05:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond)
[15:05:31] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond)
[15:05:51] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:06:09] <wikibugs>	 (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:07:12] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867)
[15:07:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye
[15:07:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye
[15:07:45] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39245/console" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:08:04] <sergi0_>	 thank you urbanecm!
[15:08:11] <urbanecm>	 no problem
[15:08:21] <urbanecm>	 it's quicker than i'd expect it to be so far
[15:09:02] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) I've disabled the multicast discovery on the ilom interface for db1140 as a test to see if it stops the netbios port broadcasts from the...
[15:09:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587
[15:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:10:37] <wikibugs>	 (03CR) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:10:44] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[15:11:10] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[15:11:34] <wikibugs>	 (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:12:22] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: triggering i18n refresh for T327824 (duration: 07m 57s)
[15:12:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:26] <stashbot>	 T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3  - https://phabricator.wikimedia.org/T327824
[15:12:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[15:12:57] <urbanecm>	 sergi0_: and it's live now. can you test please (in prod)?
[15:13:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[15:13:06] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
[15:13:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet
[15:14:06] <sergi0_>	 doing
[15:14:22] <sergi0_>	 all good
[15:14:57] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4045.ulsfo.wmnet with OS bullseye
[15:14:57] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:15:08] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye completed: - cp4045 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[15:16:20] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[15:17:28] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=cdn
[15:17:28] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=ats-be
[15:18:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:18:33] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:19:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet
[15:20:07] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:21:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[15:21:34] <wikibugs>	 (03PS1) 10Jgreen: Switch payments.wikimedia.org to codfw [dns] - 10https://gerrit.wikimedia.org/r/883590
[15:21:39] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[15:23:11] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[15:25:23] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) p:05Triage→03Medium
[15:25:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587
[15:25:51] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Switch payments.wikimedia.org to codfw [dns] - 10https://gerrit.wikimedia.org/r/883590 (owner: 10Jgreen)
[15:27:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[15:28:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: New SSH key for santhosh [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert)
[15:28:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[15:28:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) Got out of band confirmation through slack, proceeding.
[15:28:56] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin: New SSH key for santhosh [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert)
[15:29:06] <wikibugs>	 (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:29:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet
[15:30:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) The new key has been pushed, please allow for 30 minutes from this post for it to be deployed. Feel free to reopen the task if you ex...
[15:30:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) 05In progress→03Resolved
[15:33:03] <wikibugs>	 (03PS3) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756)
[15:33:25] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2031.codfw.wmnet with OS bullseye
[15:33:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye executed with errors: - cp2031 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[15:33:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye
[15:33:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye
[15:33:59] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet
[15:36:17] <wikibugs>	 (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:36:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff)
[15:37:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[15:38:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) Folks I was considering doing these upgrades on the following dates:  cloudsw1-c8-eqiad - Monday February...
[15:38:52] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569)
[15:38:56] <papaul>	 !log on going maintenance on fasw-c-eqiad
[15:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[15:39:10] <wikibugs>	 (03PS7) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799)
[15:39:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10Jclark-ctr) Raid batteries are swapped and powering on now.   Thank you for your patience  an-worker1080.eqiad.wmnet an-worker1084...
[15:41:08] <wikibugs>	 (03CR) 10Volans: "When merging this change a parallel change to the following cookbooks is also needed:" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[15:41:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:43:20] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2031.codfw.wmnet with OS bullseye
[15:43:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye executed with errors: - cp2031 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[15:44:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031.codfw.wmnet']
[15:44:11] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031.codfw.wmnet']
[15:45:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031']
[15:45:32] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031']
[15:45:52] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) The plan outlined in the task description LGTM.
[15:46:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031']
[15:46:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero)
[15:46:49] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi)
[15:47:01] <wikibugs>	 (03PS3) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783)
[15:47:36] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device fasw-c-eqiad.mgmt.eqiad.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:47:39] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye
[15:47:45] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4038 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[15:48:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[15:48:13] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[15:48:35] <icinga-wm>	 PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[15:49:24] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39246/console" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:49:25] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 34, down: 10, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:49:57] <icinga-wm>	 PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal, AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:50:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. One nit inline about whitespace. If you can resolve that, fthen feel free to merge." [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[15:52:36] <jinxer-wm>	 (Emergency syslog message) resolved: Device fasw-c-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:53:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[15:53:28] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:55:55] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I'll check our db-related hosts and I'll get back to you tomorrow
[15:56:11] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['cp2031']
[15:56:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031']
[15:56:32] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031']
[15:57:27] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) Arzhel linked to some docs and commented netbios is called wins in HP ilom, and I had noticed the wins enablement under IPv4 so disabled...
[15:58:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[15:58:31] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:58:37] <icinga-wm>	 RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[15:59:03] <icinga-wm>	 RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:00:16] <wikibugs>	 (03PS1) 10Jbond: motd: add parameter to pass through messages to motd [puppet] - 10https://gerrit.wikimedia.org/r/883595
[16:01:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39247/console" [puppet] - 10https://gerrit.wikimedia.org/r/883595 (owner: 10Jbond)
[16:02:48] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10fnegri) I think those dates are fine, cc @dcaro -- let's discuss the best way to reduce impact on Ceph (downtime,...
[16:02:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul)
[16:03:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10Papaul)
[16:03:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul) 05Open→03Resolved This is complete.
[16:03:42] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:04:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10BTullis) 05Open→03Resolved Many thanks, all good now.
[16:04:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1080, an-worker1084, and an-worker1086 - https://phabricator.wikimedia.org/T325984 (10BTullis) 05Open→03Resolved
[16:04:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[16:05:52] <wikibugs>	 (03PS1) 10Jbond: sre-sandbox: Add warning message about reaper [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517)
[16:05:56] <wikibugs>	 (03PS4) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783)
[16:06:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10ayounsi) 05Open→03Resolved All done!
[16:06:59] <wikibugs>	 10SRE, 10Traffic, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10BBlack) With the merge above, I think this issue is at least mitigated for now.  It's not...
[16:07:03] <wikibugs>	 (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[16:08:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[16:08:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet
[16:09:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] motd: add parameter to pass through messages to motd [puppet] - 10https://gerrit.wikimedia.org/r/883595 (owner: 10Jbond)
[16:09:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre-sandbox: Add warning message about reaper [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[16:09:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye
[16:09:52] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye
[16:11:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[16:12:28] <wikibugs>	 (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[16:14:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet
[16:14:21] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[16:15:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff)
[16:20:18] <wikibugs>	 (03PS1) 10Muehlenhoff: perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046)
[16:20:42] <wikibugs>	 (03PS2) 10Muehlenhoff: perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046)
[16:20:57] <wikibugs>	 (03PS4) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783)
[16:24:08] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS bullseye
[16:24:16] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6010.drmrs.wmnet with OS bullseye
[16:28:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage
[16:31:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff)
[16:32:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage
[16:32:33] <icinga-wm>	 PROBLEM - IPMI Sensor Status on an-worker1080 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:32:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10cmooney) p:05Triage→03Low
[16:33:16] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye
[16:33:22] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[16:34:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet,service=cdn
[16:34:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet,service=ats-be
[16:35:01] <wikibugs>	 (03PS1) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602
[16:36:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add recording rule for req success ratio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[16:36:58] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Revert "wdqs: add recording rule for req success ratio"" [puppet] - 10https://gerrit.wikimedia.org/r/883610
[16:37:15] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking)
[16:37:26] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064)
[16:37:43] <wikibugs>	 (03PS2) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602
[16:38:39] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]]...
[16:38:45] <wikibugs>	 (03PS3) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517)
[16:38:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39249/console" [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[16:39:09] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064)
[16:39:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[16:40:30] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10Volans) This task was brought to my attention by @ssingh today because `cp4037` did the same. It was reimaged first around `12:15` and it failed, and...
[16:41:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[16:41:34] <wikibugs>	 (03CR) 10Dzahn: "thanks! Noted that "mediawiki-testers" has been removed, but seems like that is already gone here too" [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff)
[16:41:57] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) 05Open→03Resolved a:03RobH
[16:43:18] <wikibugs>	 (03PS1) 10Jbond: motd: use colored message [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517)
[16:43:18] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage
[16:44:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) Thanks @Clement_Goubert and @Muehlenhoff for follow-ups.
[16:44:19] <wikibugs>	 (03CR) 10RLazarus: wdqs: add recording rule for req success ratio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[16:44:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39250/console" [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[16:44:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] motd: use colored message [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[16:46:21] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage
[16:46:21] <wikibugs>	 (03PS1) 10Hnowlan: fluent-bit: install wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605
[16:46:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thank you, Manuel" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[16:48:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "on our ticket we also had "archive the database" but is that a thing? I am not sure we actually drop the DB or what it entails. Probably i" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[16:49:40] <wikibugs>	 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) >>! In T247517#8533090, @herron wrote: >>>! In T247517#8211187, @jbond wrote: >>  * did the emails informing  @herro...
[16:50:56] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka jumbo-eqiad cluster: Reboot kafka nodes
[16:51:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[16:51:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2031.codfw.wmnet with OS bullseye
[16:51:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye completed: - cp2031 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[16:54:07] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) Thanks for the response @Volans!  >>! In T327812#8558090, @Volans wrote: > This task was brought to my attention by @ssingh today because `cp...
[16:54:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "sounds good to me. backed up = archived  to me :) let's do that" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[16:57:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add new intermediates for liftwing [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[16:57:30] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=cdn
[16:57:31] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be
[16:58:03] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10Volans) >>! In T327812#8558155, @ssingh wrote: > I am surprised, so the above output is for cp4037? Because we certainly didn't reboot it and in any...
[16:58:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[16:59:03] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:04:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "late vote -- thanks for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond)
[17:05:15] <wikibugs>	 (03CR) 10Hashar: "I have missed John last notifications, we chatted a bit today and aim at deploying this patch on Thursday" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[17:06:52] <wikibugs>	 (03PS1) 10Btullis: Reduce the presto task concurrency from 48 to 32 [puppet] - 10https://gerrit.wikimedia.org/r/883628 (https://phabricator.wikimedia.org/T323783)
[17:07:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10aborrero) I guess this was set up to mirror the eqiad setting. Since this VLAN as no room in the new network model (described [[ https://wikitech.wikimedia.org/wiki/Wiki...
[17:07:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Reduce the presto task concurrency from 48 to 32 [puppet] - 10https://gerrit.wikimedia.org/r/883628 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[17:07:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10aborrero)
[17:08:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) p:05Triage→03Medium
[17:09:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) >>! In T316544#8557796, @cmooney wrote: > Folks I was considering doing these upgrades on the following...
[17:10:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero)
[17:13:33] <wikibugs>	 (03PS1) 10Elukey: pki: Add public certificates for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767)
[17:14:14] <wikibugs>	 (03CR) 10Elukey: "Already committed the key pem files to the private repo :)" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[17:15:11] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:30] <wikibugs>	 (03PS1) 10Elukey: Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767)
[17:18:59] <wikibugs>	 (03PS2) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767)
[17:20:17] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:20:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:20:29] <wikibugs>	 (03PS1) 10Dzahn: remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405)
[17:23:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:24:17] <wikibugs>	 (03PS1) 10Jgreen: Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635
[17:24:32] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: remove puppet_ca_crt references [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636
[17:26:25] <wikibugs>	 (03CR) 10Herron: "following up from IRC: "<ryankemper> herron: taking a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/883223/2/modules/profil" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi)
[17:26:58] <wikibugs>	 (03PS4) 10Herron: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[17:28:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:29:01] <wikibugs>	 (03CR) 10Herron: "can confirm /usr/bin/thanos tools rules-check --rules recording_rules.yaml passes with these updated recording rules (no longer throws "ma" [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[17:29:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[17:29:55] <wikibugs>	 (03PS2) 10Dzahn: remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405)
[17:30:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[17:32:37] <mutante>	 !log removing racktables.wikimedia.org from DNS - that's it for this ancient service T327405
[17:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:42] <stashbot>	 T327405: Decommission Racktables - https://phabricator.wikimedia.org/T327405
[17:32:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] fluent-bit: install wmf-certificates (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 (owner: 10Hnowlan)
[17:34:45] <wikibugs>	 (03PS2) 10Hnowlan: fluent-bit: install wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605
[17:35:41] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 (owner: 10Jgreen)
[17:35:51] <wikibugs>	 (03PS2) 10Jgreen: Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635
[17:36:02] <wikibugs>	 (03CR) 10Jgreen: [V: 03+2] Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 (owner: 10Jgreen)
[17:41:04] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) I just went through the logs now:  ` Timestamp       = 2023-01-25 14:07:50 Message         = The server power action is initiated because the...
[17:43:16] <wikibugs>	 (03PS1) 10Btullis: Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783)
[17:45:05] <wikibugs>	 (03PS2) 10Btullis: Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783)
[17:45:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[17:47:31] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:53] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:55] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:57] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:25] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:27] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:31] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:48:49] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[17:55:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) p:05Triage→03Medium
[17:56:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[17:58:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[17:58:42] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS bullseye
[17:58:48] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6010.drmrs.wmnet with OS bullseye completed: - cp6010 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1800)
[18:00:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10cmooney) Thanks for the feedback @aborrero.  I'll plan on getting it decommissioned.
[18:05:55] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6010.drmrs.wmnet
[18:07:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[18:10:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[18:10:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[18:11:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[18:11:09] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[18:11:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[18:12:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:14:33] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[18:14:54] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS bullseye
[18:15:01] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6002.drmrs.wmnet with OS bullseye
[18:17:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:25:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:32:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[18:33:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[18:33:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[18:34:35] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage
[18:35:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[18:37:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:37:39] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage
[18:42:11] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper)
[18:42:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[18:45:14] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper)
[18:48:01] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:50:37] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski)
[18:59:07] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS bullseye
[19:00:05] <jouncebot>	 brennen and jnuche: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1900).
[19:00:05] <jouncebot>	 brennen and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1900).
[19:00:17] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye
[19:00:31] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host centrallog1002.eqiad.wmnet with OS bullseye
[19:00:34] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6002.drmrs.wmnet with OS bullseye completed: - cp6002 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[19:01:11] <brennen>	 o/
[19:01:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:01:48] <brennen>	 !log 1.40.0-wmf.20 train (T325583): no blockers, rolling to group1.
[19:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:52] <stashbot>	 T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583
[19:02:06] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583)
[19:02:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot)
[19:02:37] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) >>! In T316544#8558224, @aborrero wrote: >>>! In T316544#8557796, @cmooney wrote: >> Folks I was consider...
[19:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot)
[19:02:48] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney)
[19:04:39] <icinga-wm>	 RECOVERY - IPMI Sensor Status on an-worker1080 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:06:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:06:46] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6002.drmrs.wmnet
[19:07:17] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:09:33] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:10:00] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.20  refs T325583
[19:10:04] <stashbot>	 T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583
[19:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:10:23] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:12:38] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS bullseye
[19:12:45] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6011.drmrs.wmnet with OS bullseye
[19:13:33] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:14:53] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:16:01] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:17:04] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.20  refs T325583 (duration: 07m 04s)
[19:17:08] <stashbot>	 T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583
[19:17:09] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:29] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye
[19:22:27] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:25:39] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:26:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:29:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:31:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:33:08] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[19:33:58] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage
[19:34:50] <wikibugs>	 (03PS1) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714)
[19:35:28] <wikibugs>	 (03PS1) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714)
[19:36:10] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[19:38:36] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage
[19:39:42] <wikibugs>	 (03Abandoned) 10Dzahn: phabricator: set enable_vcs to false in main profile [puppet] - 10https://gerrit.wikimedia.org/r/864852 (owner: 10Dzahn)
[19:44:28] <wikibugs>	 (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[19:44:59] <wikibugs>	 (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[19:50:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:52:57] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host centrallog1002.eqiad.wmnet with OS bullseye
[19:55:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[19:58:14] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS bullseye
[19:58:18] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6011.drmrs.wmnet with OS bullseye completed: - cp6011 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[19:58:54] <wikibugs>	 (03PS1) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494)
[19:59:37] <wikibugs>	 (03PS2) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494)
[20:00:38] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:00:47] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6011.drmrs.wmnet
[20:03:01] <wikibugs>	 (03PS3) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494)
[20:04:58] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:05:47] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:08:29] <wikibugs>	 (03PS9) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778)
[20:10:21] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS bullseye
[20:10:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6003.drmrs.wmnet with OS bullseye
[20:10:34] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39253/console" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:11:44] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:13:22] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:15:35] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[20:16:30] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:20:35] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[20:20:52] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:22:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:23:00] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39254/console" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:23:10] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:23:14] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:26:13] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/882747/39254/" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:29:44] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafkatee.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:59] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10BCornwall) This alerting would have been helpful for another [[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | a recent incident ]] of the same nature.
[20:30:29] <wikibugs>	 (03PS2) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778)
[20:30:35] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage
[20:32:01] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39255/console" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:32:45] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo-eqiad cluster: Reboot kafka nodes
[20:33:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite)
[20:33:46] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage
[20:34:08] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/882760/39255/" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:35:35] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39256/console" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:47:14] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "This is the last CR for the failover." [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[20:49:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[20:49:08] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet
[20:49:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[20:49:29] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet
[20:49:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[20:50:07] <TheresNoTime>	 I can deploy in 10 minutes, and will be starting the merge of some of the larger patches now in preparation
[20:50:08] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of 500 status - https://phabricator.wikimedia.org/T327960 (10Eevans)
[20:50:24] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of 500 status - https://phabricator.wikimedia.org/T327960 (10Eevans) p:05Triage→03Medium
[20:50:31] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) p:05Triage→03Medium
[20:50:48] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[20:50:56] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[20:51:02] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10Eevans)
[20:51:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:54:56] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[20:56:09] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet
[20:56:45] <wikibugs>	 (03PS2) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714)
[20:57:07] <wikibugs>	 (03PS2) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714)
[20:58:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[20:58:48] <wikibugs>	 (03PS3) 10Bking: flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576)
[20:59:22] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts cp2028.codfw.wmnet
[20:59:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator: bump version to 1.3.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[20:59:27] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[20:59:31] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS bullseye
[20:59:32] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[20:59:42] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6003.drmrs.wmnet with OS bullseye completed: - cp6003 (**WARN**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T2100).
[21:00:05] <jouncebot>	 Jan Drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:09] <TheresNoTime>	 I can deploy! (and have already started the merge of 883619 and 883618)
[21:00:22] <kindrobot>	 Thanks TheresNoTime :)
[21:01:08] <jan_drewniak>	 TheresNoTime: thanks! I made a las-minute change to the backports, now I only need 883616  and 883617 
[21:01:31] <jan_drewniak>	 TheresNoTime: ah! let's try to cancel that merge!
[21:01:44] <wikibugs>	 (03CR) 10Jdrewniak: [C: 04-2] Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[21:01:47] <TheresNoTime>	 jan_drewniak: damn
[21:01:58] <TheresNoTime>	 my bad, sorry
[21:02:00] <wikibugs>	 (03CR) 10Jdrewniak: [C: 04-2] Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[21:02:13] <jan_drewniak>	 TheresNoTime: I think a -2 should do it
[21:02:27] <wikibugs>	 (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:02:31] <wikibugs>	 (03PS10) 10Ottomata: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:03:49] <jan_drewniak>	 TheresNoTime: sorry about that! these are really small changes so I merged them into one patch.
[21:04:00] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[21:04:08] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[21:05:05] <TheresNoTime>	 jan_drewniak: no worries :) I think that -2 will do it, but if not I'll revert :D
[21:05:46] <wikibugs>	 (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:06:04] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet
[21:07:11] <wikibugs>	 (03PS1) 10Bking: flink-operator: remove unnecessary newline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883657 (https://phabricator.wikimedia.org/T324576)
[21:07:40] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-operator: remove unnecessary newline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883657 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:08:02] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) I opened T327960 as well, it covers alerting based on the rate of HTTP status 500 responses.  It is (currently) the case that every status 500 will //also// emit an error log, so it woul...
[21:08:52] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:09:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:09:38] <wikibugs>	 (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:13:44] <wikibugs>	 (03PS11) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576)
[21:17:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking)
[21:21:50] <wikibugs>	 (03PS15) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[21:23:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:24:43] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]]
[21:24:47] <stashbot>	 T327714: Unexpected whitespace at the top of stub (short) articles in Vector 2022 - https://phabricator.wikimedia.org/T327714
[21:24:48] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[21:25:09] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[21:25:51] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39257/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[21:26:27] <logmsgbot>	 !log samtar@deploy1002 jdrewniak and samtar: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]] synced to the testservers: mwdebug2002.cod
[21:26:27] <logmsgbot>	 fw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:26:43] <TheresNoTime>	 jan_drewniak: those two are live on mwdebug, can you test?
[21:27:37] <jan_drewniak>	 TheresNoTime: perfect, looks good to sync!
[21:27:42] <TheresNoTime>	 ack
[21:34:10] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]] (duration: 09m 27s)
[21:34:15] <stashbot>	 T327714: Unexpected whitespace at the top of stub (short) articles in Vector 2022 - https://phabricator.wikimedia.org/T327714
[21:34:22] <TheresNoTime>	 that's now live :)
[21:34:57] <jan_drewniak>	 TheresNoTime: awesome! thanks!
[21:36:09] <wikibugs>	 (03PS1) 10Ottomata: flink-app-example - set  upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519)
[21:37:33] <wikibugs>	 (03CR) 10Bking: [C: 03+1] flink-app-example - set  upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[21:41:33] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) pinged on Slack
[21:42:00] <wikibugs>	 (03PS16) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[21:42:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-app-example - set  upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[21:43:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Dzahn) 05Open→03Resolved a:03Dzahn
[21:43:44] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Dzahn) a:05Dzahn→03Ladsgroup
[21:44:18] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply
[21:44:32] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply
[21:45:39] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39258/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[21:49:04] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply
[21:49:09] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply
[21:53:24] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM although please see commit msg nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[21:54:56] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:03:52] <wikibugs>	 (03PS3) 10Andrea Denisse: centrallog: Add centrallog1002 to the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778)
[22:04:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Add centrallog1002 to the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:05:26] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Add centrallog1002 to the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:07:31] <wikibugs>	 (03CR) 10Herron: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:10:17] <wikibugs>	 (03CR) 10Herron: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:11:54] <wikibugs>	 (03PS3) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778)
[22:13:13] <wikibugs>	 (03CR) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:13:47] <wikibugs>	 (03CR) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:14:50] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6003.drmrs.wmnet
[22:14:50] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM 🪵" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:17:25] <wikibugs>	 (03PS3) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778)
[22:18:43] <wikibugs>	 (03CR) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:19:07] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:20:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Peachey88)
[22:20:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[22:21:21] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS bullseye
[22:21:27] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6012.drmrs.wmnet with OS bullseye
[22:22:30] <wikibugs>	 (03PS4) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778)
[22:24:19] <wikibugs>	 (03CR) 10Andrea Denisse: "It makes more sense to me to add a destination first and perform the the failover of centrallog1001 -> centrallog1002 in another patch aft" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[22:25:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:26:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[22:29:08] <wikibugs>	 (03PS1) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668
[22:30:18] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[22:30:55] <wikibugs>	 (03CR) 10Urbanecm: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[22:31:02] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[22:31:30] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson)
[22:32:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson)
[22:32:42] <wikibugs>	 (03CR) 10Zabe: Enable ResourceLoader client preferences on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson)
[22:33:28] <wikibugs>	 (03PS2) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850)
[22:34:02] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[22:34:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[22:36:29] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10BCornwall) If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the status code.
[22:40:13] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[22:43:16] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[22:43:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:48:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:49:25] <wikibugs>	 (03PS3) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850)
[22:51:26] <zabe>	 jan_drewniak, ^ that config patch also enables resourceloade client preferences on prod (unlike the commit message says), was that intended?
[23:03:46] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[23:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:04:52] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[23:07:20] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS bullseye
[23:07:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6012.drmrs.wmnet with OS bullseye completed: - cp6012 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[23:07:30] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "selected namespace IDs are free (https://shn.wikibooks.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces), LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15)
[23:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:10:30] <icinga-wm>	 RECOVERY - Check systemd state on mw2293 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:31] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6012.drmrs.wmnet
[23:14:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[23:14:56] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS bullseye
[23:15:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6004.drmrs.wmnet with OS bullseye
[23:18:44] <wikibugs>	 (03PS1) 10Zabe: Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676
[23:18:59] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676 (owner: 10Zabe)
[23:19:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676 (owner: 10Zabe)
[23:20:08] <logmsgbot>	 !log zabe@deploy1002 Backport cancelled.
[23:21:33] <logmsgbot>	 !log zabe@deploy1002 Started scap: (no justification provided)
[23:29:07] <logmsgbot>	 !log zabe@deploy1002 Finished scap: (no justification provided) (duration: 07m 34s)
[23:31:40] <wikibugs>	 (03CR) 10Zabe: Enable ResourceLoader client preferences on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson)
[23:33:18] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage
[23:36:20] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage
[23:37:08] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms
[23:37:49] <wikibugs>	 (03PS1) 10Ebernhardson: Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970)
[23:46:56] <wikibugs>	 (03PS1) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970)
[23:47:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[23:54:34] <jan_drewniak>	 zabe: apologies! you're right that patch affected prod when it shouldn't have :( 
[23:57:09] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS bullseye
[23:57:16] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6004.drmrs.wmnet with OS bullseye completed: - cp6004 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[23:57:39] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6004.drmrs.wmnet
[23:58:20] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)