[00:10:43] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye [00:10:58] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host centrallog1002.eqiad.wmnet with OS bullseye [00:29:31] (03CR) 10Cwhite: [C: 03+2] logstash: remove ecs 1.11.0-2 template [puppet] - 10https://gerrit.wikimedia.org/r/882781 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [00:31:37] (03CR) 10MSantos: [C: 03+1] maps: Add missing index script on import [puppet] - 10https://gerrit.wikimedia.org/r/883197 (owner: 10Jgiannelos) [00:33:49] (03CR) 10Cwhite: [C: 03+2] logstash: Add centrallog1002 as logsource for logstash tests [puppet] - 10https://gerrit.wikimedia.org/r/882762 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [00:34:06] (03PS3) 10Acamicamacaraca: Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) [00:59:27] (03CR) 10Cwhite: [V: 04-1 C: 04-1] "See inline." [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [01:17:07] !log adjusting Gerrit group "Campaigns Team" so it is not recursively a member of itself [01:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:47] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:39] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Sotiale) Hi. Langcom is discussing this and is wondering how we can respond to the existing interwiki content... [01:55:15] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Ladsgroup) You can use global search: https://global-search.toolforge.org/?q=%22%5B%5B%3Aminnan%3A%22&namespa... [02:07:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:51:24] (03PS3) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [04:55:34] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:36] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:57] (03PS11) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [05:44:18] (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [05:53:00] (03PS12) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [05:56:55] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39240/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [06:00:47] (03PS1) 10Gergő Tisza: Enable WelcomeSurvey at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883301 (https://phabricator.wikimedia.org/T325376) [06:10:44] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:47] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:10] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:25:18] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0700) [07:00:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:19:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) Thanks John. I am going to reclone this host. [07:20:19] (03PS1) 10Marostegui: db1166: Disable notifcations [puppet] - 10https://gerrit.wikimedia.org/r/883304 [07:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 to clone db1198', diff saved to https://phabricator.wikimedia.org/P43320 and previous config saved to /var/cache/conftool/dbconfig/20230125-072033-marostegui.json [07:20:55] (03CR) 10Marostegui: [C: 03+2] db1166: Disable notifcations [puppet] - 10https://gerrit.wikimedia.org/r/883304 (owner: 10Marostegui) [07:26:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:31:41] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [07:31:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [07:34:15] !log phedenskog@deploy1002 Started deploy [performance/navtiming@bfff15d]: (no justification provided) [07:34:21] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bfff15d]: (no justification provided) (duration: 00m 05s) [07:45:36] (03PS1) 10Marostegui: db1206,db1196: Switch sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883496 (https://phabricator.wikimedia.org/T327859) [07:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206 to clone db1196 T327859', diff saved to https://phabricator.wikimedia.org/P43322 and previous config saved to /var/cache/conftool/dbconfig/20230125-074601-marostegui.json [07:46:05] T327859: Switch s1 sanitarium master from db1206 to db1196 - https://phabricator.wikimedia.org/T327859 [07:46:18] (03CR) 10Marostegui: [C: 03+2] db1206,db1196: Switch sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883496 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui) [07:48:38] (03PS1) 10Ayounsi: Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) [07:49:09] (03CR) 10CI reject: [V: 04-1] Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [07:49:18] (03PS1) 10Marostegui: db1196: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883498 (https://phabricator.wikimedia.org/T327859) [07:49:35] !log Cloning db1196 from db1206 (lag will appear on s1 wiki replicas) T327859 [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:45] (03CR) 10Marostegui: [C: 03+2] db1196: Future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883498 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui) [07:50:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:22] (03PS2) 10Ayounsi: Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) [07:54:41] (03PS1) 10Marostegui: db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883499 (https://phabricator.wikimedia.org/T327800) [07:55:03] (03CR) 10Marostegui: [C: 03+2] db1176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883499 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui) [07:59:57] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [08:00:03] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [08:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0800). [08:00:05] Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] Hello! Confirming my presence here [08:01:47] let me check [08:02:22] (03CR) 10Ladsgroup: [C: 03+2] Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca) [08:02:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca) [08:03:07] (03Merged) 10jenkins-bot: Add sandbox link to Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883221 (https://phabricator.wikimedia.org/T327833) (owner: 10Acamicamacaraca) [08:03:41] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]] [08:03:45] T327833: Add sandbox link to Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327833 [08:03:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) [08:03:57] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) [08:03:57] Should I open the Debug tool now? [08:04:47] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [08:05:34] !log ladsgroup@deploy1002 ladsgroup and aleksandar: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:07:10] Working as expected. Link is shown in the user menu now. [08:07:57] awesome moving forward [08:10:36] (03PS1) 10Muehlenhoff: Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500 [08:13:10] (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover date" [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [08:13:13] (03CR) 10Marostegui: [C: 04-2] "Wait for the switchover date" [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [08:13:55] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:883221|Add sandbox link to Serbo-Croatian Wikipedia (T327833)]] (duration: 10m 13s) [08:13:59] T327833: Add sandbox link to Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327833 [08:14:17] (03CR) 10Marostegui: [C: 03+1] "I am removing the grants from the DB now. So feel free to merge this as you wish" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:17:16] Aca: It's done [08:17:17] The change is live now. Thanks, Amir1! [08:20:06] (03CR) 10Elukey: [C: 03+1] changeprop: use wmf-certificates instead of puppet_ca_crt (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan) [08:26:01] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/881386 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:31:52] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs-idmapd [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) [08:34:47] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [08:34:55] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nfs-idmapd [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:39:21] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for blkmapd [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991) [08:40:25] !log bump SGIX max prefix limit [08:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:49] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for blkmapd [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:44:58] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for nfs.mountd [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991) [08:51:22] 10SRE, 10Infrastructure-Foundations, 10netops: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10Peachey88) [08:53:25] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nfs.mountd [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:59:17] 10SRE, 10Infrastructure-Foundations: MIgrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [08:59:19] (03PS1) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) [08:59:57] (03PS13) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:59:58] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10Peachey88) [09:00:04] brennen and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T0900) [09:00:50] (03CR) 10Ayounsi: [C: 03+2] "Awesome! Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [09:01:08] (03CR) 10CI reject: [V: 04-1] sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [09:10:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:26] (03PS14) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [09:14:51] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39242/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [09:15:38] (03CR) 10Deni: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca) [09:30:29] !log rolling depool & update of thanos front-ends T327871 [09:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:20] (03PS2) 10Muehlenhoff: puppetdb: No longer use the component on booworm [puppet] - 10https://gerrit.wikimedia.org/r/881598 [09:35:39] (03PS1) 10Marostegui: mariadb: Promote db1196 to sanitarium master in s1 [puppet] - 10https://gerrit.wikimedia.org/r/883527 (https://phabricator.wikimedia.org/T327859) [09:35:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1196 to sanitarium master in s1 [puppet] - 10https://gerrit.wikimedia.org/r/883527 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui) [09:38:48] (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883528 [09:39:17] (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883528 (owner: 10Marostegui) [09:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P43325 and previous config saved to /var/cache/conftool/dbconfig/20230125-093918-root.json [09:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43326 and previous config saved to /var/cache/conftool/dbconfig/20230125-094029-root.json [09:42:14] (03PS1) 10Marostegui: db1166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883529 [09:42:33] (03CR) 10Marostegui: [C: 03+2] db1166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883529 (owner: 10Marostegui) [09:46:12] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: No longer use the component on booworm [puppet] - 10https://gerrit.wikimedia.org/r/881598 (owner: 10Muehlenhoff) [09:47:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:16] (03PS1) 10Filippo Giunchedi: Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223 [09:49:19] PROBLEM - DPKG on thanos-fe1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:49:29] (03CR) 10CI reject: [V: 04-1] Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi) [09:49:51] PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Thanos [09:49:59] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "wdqs: add recording rule for req success ratio" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi) [09:51:09] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:23] ryankemper herron FYI ^ reverted the wdqs slo rules, due to invalid syntax, sorry ATM I don't have time to look into it further [09:52:45] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:57] (03PS1) 10Marostegui: db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883535 (https://phabricator.wikimedia.org/T327859) [09:53:07] RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Thanos [09:53:24] (03CR) 10Marostegui: [C: 03+2] db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883535 (https://phabricator.wikimedia.org/T327859) (owner: 10Marostegui) [09:54:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P43327 and previous config saved to /var/cache/conftool/dbconfig/20230125-095400-root.json [09:54:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P43328 and previous config saved to /var/cache/conftool/dbconfig/20230125-095423-root.json [09:55:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [09:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43329 and previous config saved to /var/cache/conftool/dbconfig/20230125-095534-root.json [09:58:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans) [09:58:51] (03CR) 10Volans: [C: 03+2] setup.py: force a newer sphinx_rtd_theme [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans) [09:59:36] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:59:38] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:42] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Thanos [10:03:05] (03Merged) 10jenkins-bot: setup.py: force a newer sphinx_rtd_theme [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 (owner: 10Volans) [10:03:08] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:10] RECOVERY - Check systemd state on thanos-fe1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:48] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Thanos [10:07:51] (03PS1) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/883538 [10:08:14] (03PS1) 10Jelto: KubernetesAPIErrorRate: make alert v1.23 compatible [alerts] - 10https://gerrit.wikimedia.org/r/883539 (https://phabricator.wikimedia.org/T322919) [10:08:27] (03CR) 10Hnowlan: [C: 03+2] changeprop: use wmf-certificates instead of puppet_ca_crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan) [10:09:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P43330 and previous config saved to /var/cache/conftool/dbconfig/20230125-100904-root.json [10:09:10] (03PS1) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/883540 [10:09:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P43331 and previous config saved to /var/cache/conftool/dbconfig/20230125-100928-root.json [10:10:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43332 and previous config saved to /var/cache/conftool/dbconfig/20230125-101039-root.json [10:14:43] (03Merged) 10jenkins-bot: changeprop: use wmf-certificates instead of puppet_ca_crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan) [10:18:27] hnowlan: \o/ [10:18:34] 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi) [10:19:38] RECOVERY - DPKG on thanos-fe1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:21:17] 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi) p:05Triage→03Low [10:22:01] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service: Map memberOf to OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [10:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P43333 and previous config saved to /var/cache/conftool/dbconfig/20230125-102409-root.json [10:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P43334 and previous config saved to /var/cache/conftool/dbconfig/20230125-102433-root.json [10:25:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43335 and previous config saved to /var/cache/conftool/dbconfig/20230125-102544-root.json [10:36:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:37:49] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:38:41] (03CR) 10Elukey: "John I changed the name of the CAs to reflect more that liftwing == ml-serve k8s clusters (after chatting with Janis). Lemme know if the +" [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:39:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P43336 and previous config saved to /var/cache/conftool/dbconfig/20230125-103914-root.json [10:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P43337 and previous config saved to /var/cache/conftool/dbconfig/20230125-103938-root.json [10:39:56] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Talked to John about it, we'll try to get it done this week :) [10:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43338 and previous config saved to /var/cache/conftool/dbconfig/20230125-104049-root.json [10:41:36] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10Clement_Goubert) >>! In T239862#8504801, @LSobanski wrote: > @Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP conf... [10:43:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:47:21] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) 05Open→03In progress [10:47:33] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) I "documented" how I did it in case it is useful and for sanity check: {P43339} [10:48:40] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:48:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:49:33] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:49:43] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [10:51:45] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P43340 and previous config saved to /var/cache/conftool/dbconfig/20230125-105419-root.json [10:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P43341 and previous config saved to /var/cache/conftool/dbconfig/20230125-105443-root.json [10:54:46] !log restarting lvs on lvs2010 for thumbor healthcheck change [10:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43342 and previous config saved to /var/cache/conftool/dbconfig/20230125-105554-root.json [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1100) [11:00:41] !log restarting lvs on lvs1010 for thumbor healthcheck change [11:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:52] !log restarting lvs on lvs1020 for thumbor healthcheck change [11:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:57] sigh [11:02:50] (03PS1) 10Sergio Gimeno: User impact: ammend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) [11:08:24] !log restarting lvs on lvs2009 for thumbor healthcheck change [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P43343 and previous config saved to /var/cache/conftool/dbconfig/20230125-110924-root.json [11:10:05] (03PS1) 10Sergio Gimeno: User impact: ammend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883548 (https://phabricator.wikimedia.org/T327824) [11:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After cloning db1198', diff saved to https://phabricator.wikimedia.org/P43344 and previous config saved to /var/cache/conftool/dbconfig/20230125-111059-root.json [11:11:48] (03Abandoned) 10Sergio Gimeno: User impact: ammend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883548 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno) [11:12:09] !log restarting lvs on lvs1019 for thumbor healthcheck change [11:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:54] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:19:45] (03PS2) 10Kosta Harlan: User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno) [11:20:14] jouncebot: now [11:20:14] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1100) [11:20:36] I’d like to run a maintenance script for Wikidata – shout if I shouldn’t, otherwise I’ll go ahead in 5 minutes or so :) [11:21:23] (03PS1) 10Jakob: REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) [11:22:06] (03CR) 10Silvan Heintze: [C: 03+1] "thanks!" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [11:25:59] ok, I’m starting the maintenance script for T325942 [11:25:59] T325942: Christmas 2022 wbs_propertypairs table update on Wikidata - https://phabricator.wikimedia.org/T325942 [11:26:06] shouldn’t take too long, a few minutes according to the docs [11:26:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) [11:27:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:29:16] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) @Papaul is there a day that you can assist me with making network changes to complete this? [11:30:47] (03PS5) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) [11:31:19] (03PS6) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) [11:31:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet [11:32:54] (03CR) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [11:32:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) 05Open→03Resolved This is complete @BTullis [11:33:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10BTullis) Great. Many thanks. [11:34:45] !log Updated the Wikidata property suggester with data from 20230102's JSON dump (T325942) [11:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:49] T325942: Christmas 2022 wbs_propertypairs table update on Wikidata - https://phabricator.wikimedia.org/T325942 [11:34:49] * Lucas_WMDE done [11:37:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet [11:38:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet [11:41:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet [11:53:08] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [11:56:07] (03CR) 10Volans: [C: 03+1] "LGTM, just make sure that the generated data is indeed what expected when deploying it." [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [11:58:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install1004.wikimedia.org [11:58:10] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:05:30] (03CR) 10Volans: "couple of comments inline" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [12:10:28] (03CR) 10Lucas Werkmeister (WMDE): REST: Use error log level for unexpected errors (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [12:12:01] !log installing libtasn security updates on buster [12:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1004.wikimedia.org - jmm@cumin2002" [12:13:19] (03CR) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [12:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1004.wikimedia.org - jmm@cumin2002" [12:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:22] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install1004.wikimedia.org on all recursors [12:13:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install1004.wikimedia.org on all recursors [12:14:38] (03Abandoned) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883227 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [12:15:00] 10SRE, 10SRE-Access-Requests: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10santhosh) [12:15:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [12:15:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye [12:19:42] (03PS2) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) [12:20:08] (03CR) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [12:23:03] 10SRE, 10SRE-Access-Requests: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert [12:27:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install1004.wikimedia.org [12:32:30] (03PS2) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) [12:33:36] (03PS9) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [12:33:52] (03PS1) 10Ilias Sarantopoulos: ml-services: update staging image with new nltk dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/883559 [12:34:20] (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [12:35:00] (03CR) 10Stevemunene: [V: 03+1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [12:37:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install2004.wikimedia.org [12:37:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:41:50] !log restarting slapd on r/w servers to pick up new libtasn [12:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:51] (03PS10) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [12:42:54] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Clement_Goubert) FYI this created warnings in `cross-validate-accounts`, CR incoming. [12:42:57] !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe1002.eqiad.wmnet [12:43:04] !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe1003.eqiad.wmnet [12:43:46] !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe2002.codfw.wmnet [12:43:50] !log filippo@cumin1001 conftool action : set/pooled=no; selector: service=thanos-web,name=thanos-fe2003.codfw.wmnet [12:44:10] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883562 (https://phabricator.wikimedia.org/T305979) (owner: 10Clément Goubert) [12:44:14] FWIW I did the above because thanos-web is behind SSO and only one host can be pooled at the time (sso sessions are not shared) [12:44:17] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10MoritzMuehlenhoff) >>! In T305979#8557028, @Clement_Goubert wrote: > FYI this created warnings in `cross-valid... [12:44:20] bummer I know :( [12:45:14] !log restarting Exim on MXes to pick up new libtasn [12:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:21] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:47:59] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert) [12:49:02] (03PS3) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) [12:50:01] (03CR) 10Clément Goubert: [C: 03+1] Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff) [12:50:03] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:50:32] (03Abandoned) 10Clément Goubert: openldap: Add mwdebuggers to cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/883562 (https://phabricator.wikimedia.org/T305979) (owner: 10Clément Goubert) [12:50:54] (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [12:51:51] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Clement_Goubert) >>! In T305979#8557029, @MoritzMuehlenhoff wrote: >>>! In T305979#8557028, @Clement_Goubert w... [12:53:07] (03CR) 10Jbond: [C: 03+1] profile::pki::root_ca: add new intermediates for liftwing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [12:53:10] (03CR) 10Herron: "Thx for addressing this, will take a closer look" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi) [12:54:01] !log jnuche@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [12:54:23] !log jnuche@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 21s) [12:54:59] !log disable puppet fleet wide to deploy gerrit:883233 [12:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:02] (03CR) 10Muehlenhoff: [C: 03+2] Extend group difference list for new mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff) [12:57:02] (03PS1) 10Hnowlan: imagemagick: use JSON output from exiftool [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) [12:59:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2004.wikimedia.org - jmm@cumin2002" [12:59:33] (03CR) 10Jbond: [C: 03+2] Puppetfile: order puppet file and add some addtional notes [puppet] - 10https://gerrit.wikimedia.org/r/883232 (owner: 10Jbond) [12:59:37] (03CR) 10Jbond: [C: 03+2] augeas_core: add augeas core module to the vendor modules [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [13:00:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2004.wikimedia.org - jmm@cumin2002" [13:00:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:00:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install2004.wikimedia.org on all recursors [13:00:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install2004.wikimedia.org on all recursors [13:04:35] !log enable puppet fleet wide to post deploy gerrit:883233 [13:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:55] !log puppet now using vendored version of augeas-core https://gerrit.wikimedia.org/r/c/operations/puppet/+/883233 [13:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:40] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4037.ulsfo.wmnet with OS bullseye [13:11:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors: - cp4037 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [13:14:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install2004.wikimedia.org [13:17:11] (03CR) 10DCausse: [C: 03+1] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [13:17:16] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Pulled drive will advise when it can be reinserted [13:18:21] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:18:29] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:19] (03PS1) 10Ayounsi: Stop using profile::contact [puppet] - 10https://gerrit.wikimedia.org/r/883565 [13:20:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack::haproxy::site: don't provision backend FW rules [puppet] - 10https://gerrit.wikimedia.org/r/868070 (owner: 10Majavah) [13:21:42] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:22:26] (03PS1) 10Muehlenhoff: prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 [13:26:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install3002.wikimedia.org [13:26:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:27:41] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:27:44] hashar: any maintenace ongoing on gerrit? seem down for multiple of us [13:27:52] hello icinga-wm, just in time [13:27:58] cc slyngs, _joe_ [13:28:49] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:29:08] volans: Looking [13:30:17] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67102 bytes in 0.043 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:30:23] (03CR) 10Ayounsi: "The error doesn't seem to be related to this change, but to the ping1002 decom." [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:30:47] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:30:50] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3002.wikimedia.org - jmm@cumin2002" [13:31:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3002.wikimedia.org - jmm@cumin2002" [13:31:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:53] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install3002.wikimedia.org on all recursors [13:31:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install3002.wikimedia.org on all recursors [13:32:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:32:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:32:47] (JobUnavailable) resolved: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:21] (03CR) 10Volans: [C: 04-1] "That's not the correct regex" [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff) [13:34:36] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T327902 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:34:40] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10ops-monitoring-bot) [13:35:02] (03CR) 10Volans: "thanks for the fix!" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [13:35:24] (03CR) 10Ayounsi: [C: 03+2] Stop using profile::contact [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:36:09] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10Marostegui) [13:36:10] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) [13:36:13] (03PS2) 10Muehlenhoff: prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 [13:36:16] (03CR) 10Muehlenhoff: prospector: Allow longer variable names (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff) [13:38:15] (03CR) 10Muehlenhoff: "Shouldn't we also remove profile::contact itself?" [puppet] - 10https://gerrit.wikimedia.org/r/883565 (owner: 10Ayounsi) [13:38:43] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output... [13:38:47] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T327902 (10Marostegui) 05Open→03Declined This is part of a test T325046 [13:38:49] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) [13:39:10] (03CR) 10Jakob: REST: Use error log level for unexpected errors (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [13:39:14] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10MoritzMuehlenhoff) >>! In T325046#8557233, @Marostegui wrote: > @Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly... [13:42:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: metadata: allow haproxy backend connections [puppet] - 10https://gerrit.wikimedia.org/r/883571 [13:46:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install3002.wikimedia.org [13:48:16] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Papaul) @Jclark-ctr is this Friday 9:30 amd CT works for you? Also before that can you update the task whit the exact rack location where the servers are moving? [13:49:36] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Thanks Moritz, do you need the disk to be left out? [13:49:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/883571/39244/" [puppet] - 10https://gerrit.wikimedia.org/r/883571 (owner: 10Arturo Borrero Gonzalez) [13:50:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org [13:50:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4002.wikimedia.org [13:51:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4002.wikimedia.org [13:51:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4002.wikimedia.org [13:51:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "(good to deploy as far as I’m concerned)" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [13:52:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: nova: metadata: allow haproxy backend connections [puppet] - 10https://gerrit.wikimedia.org/r/883571 (owner: 10Arturo Borrero Gonzalez) [13:56:00] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) We chatted on IRC and we are leaving the disk on a failed state for now until @MoritzMuehlenhoff is done with his tests. [13:56:13] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10MoritzMuehlenhoff) >>! In T325046#8557292, @Marostegui wrote: > Thanks Moritz, do you need the disk to be left out? Yeah, let's keep it for a few d... [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1400). [14:00:05] jakob_WMDE, MichaelG_WMDE, Aca, and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] Confirming my presence. Hello once again! :) [14:00:09] * MichaelG_WMDE waves 👋 [14:00:13] I can deploy today! [14:00:15] hi [14:00:16] hello everyone [14:00:21] hi! [14:00:22] hey sergi0_ [14:00:29] hi jakob_WMDE! [14:00:36] (03CR) 10Urbanecm: [C: 03+2] REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [14:00:39] (03CR) 10Urbanecm: [C: 03+2] User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno) [14:01:04] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) The above patch should cause native lazy loading of images by the browser. This w... [14:01:23] MichaelG_WMDE: hi, are you around? :) or will jakob_WMDE handle the config patc htoo? [14:01:36] I'm around :) [14:01:45] And I think we both are able to do so [14:01:54] but we should do jakob's change first [14:02:09] I’m busy for now, can deploy later if needed [14:02:17] (03CR) 10Urbanecm: [C: 03+1] "id is free (https://sh.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces), otherwise LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca) [14:02:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca) [14:02:49] MichaelG_WMDE: so, the config change depends on the backport? [14:03:08] (03Merged) 10jenkins-bot: Enable Draft namespace on Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883222 (https://phabricator.wikimedia.org/T327864) (owner: 10Acamicamacaraca) [14:03:08] technically no [14:03:31] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]] [14:03:34] okay [14:03:35] T327864: Enable Draft namespace on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327864 [14:03:36] the backport raises error logging-level for the API endpoint that is enabled by the config change [14:03:46] gotcha [14:04:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5002.wikimedia.org [14:05:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:05:20] !log urbanecm@deploy1002 aleksandar and urbanecm: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:05:34] Aca: your change is available for testing at mwdebug1001. can you check? [14:05:40] Yep. On it [14:07:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [14:07:40] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye [14:08:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5002.wikimedia.org - jmm@cumin2002" [14:08:54] Working as expected. Draft namespace is now identified as a separate namespace in the lists and on the page info. [14:09:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5002.wikimedia.org - jmm@cumin2002" [14:09:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:09:01] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5002.wikimedia.org on all recursors [14:09:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5002.wikimedia.org on all recursors [14:10:08] Aca: thanks, syncing [14:10:50] (03CR) 10Elukey: [C: 03+2] ml-services: update staging image with new nltk dependency [deployment-charts] - 10https://gerrit.wikimedia.org/r/883559 (owner: 10Ilias Sarantopoulos) [14:16:15] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:16:21] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883222|Enable Draft namespace on Serbo-Croatian Wikipedia (T327864)]] (duration: 12m 59s) [14:16:34] T327864: Enable Draft namespace on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T327864 [14:16:38] Aca: your change should be live [14:16:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [14:16:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno) [14:17:13] Yeah, it is. Thank you! Have a nice day y'all! [14:17:33] you too! [14:18:00] (03Merged) 10jenkins-bot: REST: Use error log level for unexpected errors [extensions/Wikibase] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883224 (https://phabricator.wikimedia.org/T327490) (owner: 10Jakob) [14:20:43] (03Merged) 10jenkins-bot: User impact: amend incorrect parameter for the single day streak text [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883547 (https://phabricator.wikimedia.org/T327824) (owner: 10Sergio Gimeno) [14:21:10] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]] [14:21:15] T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490 [14:21:16] T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3 - https://phabricator.wikimedia.org/T327824 [14:21:17] finally :) [14:21:50] urbanecm: fyi as MichaelG_WMDE said the wikibase backport is a trivial patch to raise the log level for REST API errors. not possible to verify on its own, but we'll see whether everything works once we flip the config switch [14:21:58] gotcha [14:22:10] I'll just deploy it together with sergi0_'s backport [14:22:14] (03PS3) 10BBlack: Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285) [14:23:15] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install5002.wikimedia.org [14:24:07] 10SRE, 10ops-codfw, 10ops-eqiad: Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) Please note all 5 of these host are old HP ProLiants. I'm not sure where this setting is on these hosts, but I'm assuming in the bios and each of these will require downtime/reboot to disable.... [14:24:55] 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) [14:25:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install6002.wikimedia.org [14:25:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:27:05] (03CR) 10BBlack: [C: 03+2] Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285) (owner: 10BBlack) [14:28:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [14:29:00] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:29:08] 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10ayounsi) Note that it's on their IPMI/ILO interfaces, not sure they need to go down. [14:29:10] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:29:20] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:29:27] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:29:37] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:29:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:29:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6002.wikimedia.org - jmm@cumin2002" [14:29:55] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:30:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS bullseye [14:30:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye [14:30:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6002.wikimedia.org - jmm@cumin2002" [14:30:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:49] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install6002.wikimedia.org on all recursors [14:30:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install6002.wikimedia.org on all recursors [14:32:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [14:32:44] (03PS1) 10Jbond: idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 [14:34:34] (03CR) 10CI reject: [V: 04-1] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond) [14:35:16] scap takes an ethernity... [14:36:29] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I did a second rclone run on 24 Jan, hoping that entries in that list that weren't in the 23 Jan list would be enlightening. Extracting the copy list as before, then: ` join... [14:39:58] !log urbanecm@deploy1002 jakob and sgimeno and urbanecm: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:40:03] finally [14:40:04] T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490 [14:40:04] T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3 - https://phabricator.wikimedia.org/T327824 [14:40:13] sergi0_: can you check at mwdebug1001? [14:40:47] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10BTullis) @Jclark-ctr - sorry to trouble you, but do you know when you might be able to replace the batteries in these three hosts?... [14:41:00] sure [14:42:50] idk what did not went well but I can't see the text change :( [14:44:14] ah, it's an i18n change... [14:44:17] anything specific to check for a i18n change like this one? [14:44:19] ...that i'll need to do a full scap sync [14:44:22] which takes an hour [14:44:39] well, what can i do :) [14:44:48] i'll proceed now, and then do a full scap afterwards to make it into effect [14:44:50] Sorry, should have warned you before [14:44:58] no worries, i should've noticed [14:45:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install6002.wikimedia.org [14:45:28] (03CR) 10Michael Große: [C: 03+1] Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [14:47:40] I only just now noticed that the change still had my -1 from Monday. We had the required meeting yesterday where it was green-lit and so it is good to go :) [14:50:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [14:52:04] (03PS1) 10Muehlenhoff: Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867) [14:53:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [14:53:31] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:883224|REST: Use error log level for unexpected errors (T327490)]], [[gerrit:883547|User impact: amend incorrect parameter for the single day streak text (T327824)]] (duration: 32m 21s) [14:53:36] T327490: Create an easy way to observe/monitor Wikibase REST API errors happening on Wikidata - https://phabricator.wikimedia.org/T327490 [14:53:36] T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3 - https://phabricator.wikimedia.org/T327824 [14:53:36] finally! [14:53:37] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582 [14:53:39] !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [14:53:44] my longest scap backport ever so far [14:54:01] (03PS2) 10Urbanecm: Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [14:54:11] doing the config patch now [14:54:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [14:54:22] * MichaelG_WMDE is here for it :) [14:55:02] (03Merged) 10jenkins-bot: Enable the Wikibase REST API on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [14:55:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]] [14:55:32] T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999 [14:57:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye [14:57:20] !log urbanecm@deploy1002 urbanecm and migr: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:57:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye completed: - cp4037 (**PASS**) - Removed from Puppet and PuppetDB if present -... [14:57:30] MichaelG_WMDE: your patch is at mwdebug1001, can you test? [14:57:46] I'll have a look! [14:58:08] working for me! :) [14:58:20] works for me, too! [14:58:29] so, let's sync? [14:58:35] yes \o/ [14:58:41] doing! [14:58:41] yep, let's go [14:59:31] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) 05Open→03In progress p:05Triage→03Medium [14:59:58] (03PS1) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) [15:00:19] (03CR) 10CI reject: [V: 04-1] Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:01:01] !log Overrunning B&C window [15:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:20] (03PS2) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) [15:02:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=cdn [15:02:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-be [15:04:11] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:882615|Enable the Wikibase REST API on Wikidata (T324999)]] (duration: 08m 43s) [15:04:14] (03CR) 10Btullis: "I think that you also need to make similar chanes in hieradata/role/common/analytics_cluster/coordinator.yaml for an-coord1001" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:04:15] T324999: configure Wikibase REST API on Wikidata - https://phabricator.wikimedia.org/T324999 [15:04:19] MichaelG_WMDE: and it's live :) [15:04:24] !log urbanecm@deploy1002 Started scap: triggering i18n refresh for T327824 [15:04:24] (03PS2) 10Jbond: idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 [15:04:27] T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3 - https://phabricator.wikimedia.org/T327824 [15:04:31] doing the i18n rebuild now [15:04:56] yay! [15:05:22] fyi sergi0_ ^^ [15:05:24] (03CR) 10Jbond: [C: 03+2] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond) [15:05:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp - standalone: Add local hacks [puppet] - 10https://gerrit.wikimedia.org/r/883580 (owner: 10Jbond) [15:05:51] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:06:09] (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:07:12] (03PS2) 10Muehlenhoff: Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867) [15:07:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye [15:07:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye [15:07:45] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39245/console" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:08:04] thank you urbanecm! [15:08:11] no problem [15:08:21] it's quicker than i'd expect it to be so far [15:09:02] 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) I've disabled the multicast discovery on the ilom interface for db1140 as a test to see if it stops the netbios port broadcasts from the... [15:09:29] (03PS1) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587 [15:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:10:37] (03CR) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:10:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:11:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:11:34] (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:12:22] !log urbanecm@deploy1002 Finished scap: triggering i18n refresh for T327824 (duration: 07m 57s) [15:12:25] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:26] T327824: [wmf-20] testwiki - ext-growthExperiments-ScoreCards__scorecard__info for one day streak shows $3 - https://phabricator.wikimedia.org/T327824 [15:12:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [15:12:57] sergi0_: and it's live now. can you test please (in prod)? [15:13:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:13:06] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [15:13:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [15:14:06] doing [15:14:22] all good [15:14:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4045.ulsfo.wmnet with OS bullseye [15:14:57] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:08] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS bullseye completed: - cp4045 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [15:16:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:17:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=cdn [15:17:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=ats-be [15:18:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:18:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:19:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [15:20:07] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:21:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [15:21:34] (03PS1) 10Jgreen: Switch payments.wikimedia.org to codfw [dns] - 10https://gerrit.wikimedia.org/r/883590 [15:21:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [15:23:11] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [15:25:23] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) p:05Triage→03Medium [15:25:39] (03PS2) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587 [15:25:51] (03CR) 10Jgreen: [C: 03+2] Switch payments.wikimedia.org to codfw [dns] - 10https://gerrit.wikimedia.org/r/883590 (owner: 10Jgreen) [15:27:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [15:28:10] (03CR) 10Ssingh: [C: 03+1] admin: New SSH key for santhosh [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert) [15:28:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [15:28:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) Got out of band confirmation through slack, proceeding. [15:28:56] (03CR) 10Clément Goubert: [C: 03+2] admin: New SSH key for santhosh [puppet] - 10https://gerrit.wikimedia.org/r/883561 (https://phabricator.wikimedia.org/T327891) (owner: 10Clément Goubert) [15:29:06] (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:29:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [15:30:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) The new key has been pushed, please allow for 30 minutes from this post for it to be deployed. Feel free to reopen the task if you ex... [15:30:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Santhosh Thottingal for production access - https://phabricator.wikimedia.org/T327891 (10Clement_Goubert) 05In progress→03Resolved [15:33:03] (03PS3) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) [15:33:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2031.codfw.wmnet with OS bullseye [15:33:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye executed with errors: - cp2031 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [15:33:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye [15:33:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye [15:33:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [15:36:17] (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:36:24] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff) [15:37:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [15:38:27] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) Folks I was considering doing these upgrades on the following dates: cloudsw1-c8-eqiad - Monday February... [15:38:52] (03PS1) 10Jelto: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) [15:38:56] !log on going maintenance on fasw-c-eqiad [15:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:00] (03CR) 10Btullis: [C: 03+2] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [15:39:10] (03PS7) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) [15:39:50] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10Jclark-ctr) Raid batteries are swapped and powering on now. Thank you for your patience an-worker1080.eqiad.wmnet an-worker1084... [15:41:08] (03CR) 10Volans: "When merging this change a parallel change to the following cookbooks is also needed:" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [15:41:31] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:43:20] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2031.codfw.wmnet with OS bullseye [15:43:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye executed with errors: - cp2031 (**FAIL**) - Removed from Puppet and PuppetDB if p... [15:44:03] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031.codfw.wmnet'] [15:44:11] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031.codfw.wmnet'] [15:45:25] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031'] [15:45:32] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031'] [15:45:52] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) The plan outlined in the task description LGTM. [15:46:00] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031'] [15:46:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) [15:46:49] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [15:47:01] (03PS3) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) [15:47:36] (Emergency syslog message) firing: Alert for device fasw-c-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:47:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye [15:47:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4038 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [15:48:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [15:48:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [15:48:35] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:49:24] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39246/console" [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:49:25] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 34, down: 10, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:49:57] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal, AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:29] (03CR) 10Btullis: [C: 03+1] "Looks good. One nit inline about whitespace. If you can resolve that, fthen feel free to merge." [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [15:52:36] (Emergency syslog message) resolved: Device fasw-c-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:53:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [15:53:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:55:55] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I'll check our db-related hosts and I'll get back to you tomorrow [15:56:11] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['cp2031'] [15:56:20] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2031'] [15:56:32] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2031'] [15:57:27] 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) Arzhel linked to some docs and commented netbios is called wins in HP ilom, and I had noticed the wins enablement under IPv4 so disabled... [15:58:28] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:58:31] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:58:37] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [15:59:03] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:00:16] (03PS1) 10Jbond: motd: add parameter to pass through messages to motd [puppet] - 10https://gerrit.wikimedia.org/r/883595 [16:01:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39247/console" [puppet] - 10https://gerrit.wikimedia.org/r/883595 (owner: 10Jbond) [16:02:48] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10fnegri) I think those dates are fine, cc @dcaro -- let's discuss the best way to reduce impact on Ceph (downtime,... [16:02:58] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul) [16:03:08] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10Papaul) [16:03:40] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul) 05Open→03Resolved This is complete. [16:03:42] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:04:18] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10BTullis) 05Open→03Resolved Many thanks, all good now. [16:04:30] 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1080, an-worker1084, and an-worker1086 - https://phabricator.wikimedia.org/T325984 (10BTullis) 05Open→03Resolved [16:04:38] !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [16:05:52] (03PS1) 10Jbond: sre-sandbox: Add warning message about reaper [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517) [16:05:56] (03PS4) 10Stevemunene: Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) [16:06:48] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10ayounsi) 05Open→03Resolved All done! [16:06:59] 10SRE, 10Traffic, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10BBlack) With the merge above, I think this issue is at least mitigated for now. It's not... [16:07:03] (03CR) 10Btullis: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [16:08:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [16:08:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [16:09:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] motd: add parameter to pass through messages to motd [puppet] - 10https://gerrit.wikimedia.org/r/883595 (owner: 10Jbond) [16:09:26] (03CR) 10Jbond: [C: 03+2] sre-sandbox: Add warning message about reaper [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [16:09:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS bullseye [16:09:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye [16:11:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [16:12:28] (03CR) 10Stevemunene: Tuning Presto Config for scaling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [16:14:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [16:14:21] (03CR) 10Stevemunene: [C: 03+2] Tuning Presto Config for scaling [puppet] - 10https://gerrit.wikimedia.org/r/883583 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [16:15:30] (03CR) 10Muehlenhoff: [C: 03+2] prospector: Allow longer variable names [cookbooks] - 10https://gerrit.wikimedia.org/r/883566 (owner: 10Muehlenhoff) [16:20:18] (03PS1) 10Muehlenhoff: perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) [16:20:42] (03PS2) 10Muehlenhoff: perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) [16:20:57] (03PS4) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) [16:24:08] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS bullseye [16:24:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6010.drmrs.wmnet with OS bullseye [16:28:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage [16:31:16] (03CR) 10Marostegui: [C: 03+1] perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [16:32:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage [16:32:33] PROBLEM - IPMI Sensor Status on an-worker1080 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:32:52] 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10cmooney) p:05Triage→03Low [16:33:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye [16:33:22] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**PASS**) - Removed from Puppet and PuppetDB if present -... [16:34:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet,service=cdn [16:34:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet,service=ats-be [16:35:01] (03PS1) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 [16:36:09] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add recording rule for req success ratio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [16:36:58] (03PS1) 10Ryan Kemper: Revert "Revert "wdqs: add recording rule for req success ratio"" [puppet] - 10https://gerrit.wikimedia.org/r/883610 [16:37:15] (03CR) 10Ottomata: [C: 03+1] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [16:37:26] (03PS2) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) [16:37:43] (03PS2) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 [16:38:39] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]]... [16:38:45] (03PS3) 10Jbond: motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517) [16:38:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39249/console" [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [16:39:09] (03PS3) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) [16:39:12] (03CR) 10Jbond: [C: 03+2] motd: allow coloured messages and use red for sre-sandbox [puppet] - 10https://gerrit.wikimedia.org/r/883602 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [16:40:30] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10Volans) This task was brought to my attention by @ssingh today because `cp4037` did the same. It was reimaged first around `12:15` and it failed, and... [16:41:23] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [16:41:34] (03CR) 10Dzahn: "thanks! Noted that "mediawiki-testers" has been removed, but seems like that is already gone here too" [puppet] - 10https://gerrit.wikimedia.org/r/883500 (owner: 10Muehlenhoff) [16:41:57] 10SRE, 10ops-codfw, 10ops-eqiad, 10Data-Persistence, 10cloud-services-team (Hardware): Disable NETBIOS on some IPMI - https://phabricator.wikimedia.org/T327877 (10RobH) 05Open→03Resolved a:03RobH [16:43:18] (03PS1) 10Jbond: motd: use colored message [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517) [16:43:18] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [16:44:15] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) Thanks @Clement_Goubert and @Muehlenhoff for follow-ups. [16:44:19] (03CR) 10RLazarus: wdqs: add recording rule for req success ratio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [16:44:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39250/console" [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [16:44:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] motd: use colored message [puppet] - 10https://gerrit.wikimedia.org/r/883604 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [16:46:21] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [16:46:21] (03PS1) 10Hnowlan: fluent-bit: install wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 [16:46:47] (03CR) 10Dzahn: [C: 03+2] "thank you, Manuel" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [16:48:51] (03CR) 10Dzahn: [C: 03+2] "on our ticket we also had "archive the database" but is that a thing? I am not sure we actually drop the DB or what it entails. Probably i" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [16:49:40] 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) >>! In T247517#8533090, @herron wrote: >>>! In T247517#8211187, @jbond wrote: >> * did the emails informing @herro... [16:50:56] !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka jumbo-eqiad cluster: Reboot kafka nodes [16:51:40] (03CR) 10Marostegui: [C: 03+1] mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [16:51:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2031.codfw.wmnet with OS bullseye [16:51:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2031.codfw.wmnet with OS bullseye completed: - cp2031 (**PASS**) - Removed from Puppet and PuppetDB if present -... [16:54:07] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) Thanks for the response @Volans! >>! In T327812#8558090, @Volans wrote: > This task was brought to my attention by @ssingh today because `cp... [16:54:28] (03CR) 10Dzahn: [C: 03+2] "sounds good to me. backed up = archived to me :) let's do that" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [16:57:19] (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add new intermediates for liftwing [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:57:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=cdn [16:57:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be [16:58:03] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10Volans) >>! In T327812#8558155, @ssingh wrote: > I am surprised, so the above output is for cp4037? Because we certainly didn't reboot it and in any... [16:58:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [16:59:03] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:04:42] (03CR) 10Andrew Bogott: [C: 03+1] "late vote -- thanks for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/883596 (https://phabricator.wikimedia.org/T247517) (owner: 10Jbond) [17:05:15] (03CR) 10Hashar: "I have missed John last notifications, we chatted a bit today and aim at deploying this patch on Thursday" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [17:06:52] (03PS1) 10Btullis: Reduce the presto task concurrency from 48 to 32 [puppet] - 10https://gerrit.wikimedia.org/r/883628 (https://phabricator.wikimedia.org/T323783) [17:07:18] 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10aborrero) I guess this was set up to mirror the eqiad setting. Since this VLAN as no room in the new network model (described [[ https://wikitech.wikimedia.org/wiki/Wiki... [17:07:33] (03CR) 10Btullis: [C: 03+2] Reduce the presto task concurrency from 48 to 32 [puppet] - 10https://gerrit.wikimedia.org/r/883628 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [17:07:58] 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10aborrero) [17:08:21] 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) p:05Triage→03Medium [17:09:27] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) >>! In T316544#8557796, @cmooney wrote: > Folks I was considering doing these upgrades on the following... [17:10:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) [17:13:33] (03PS1) 10Elukey: pki: Add public certificates for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) [17:14:14] (03CR) 10Elukey: "Already committed the key pem files to the private repo :)" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:15:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:30] (03PS1) 10Elukey: Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767) [17:18:59] (03PS2) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) [17:20:17] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:20:29] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:20:29] (03PS1) 10Dzahn: remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405) [17:23:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:24:17] (03PS1) 10Jgreen: Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 [17:24:32] (03PS1) 10Hnowlan: api-gateway: remove puppet_ca_crt references [deployment-charts] - 10https://gerrit.wikimedia.org/r/883636 [17:26:25] (03CR) 10Herron: "following up from IRC: " herron: taking a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/883223/2/modules/profil" [puppet] - 10https://gerrit.wikimedia.org/r/883223 (owner: 10Filippo Giunchedi) [17:26:58] (03PS4) 10Herron: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [17:28:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:29:01] (03CR) 10Herron: "can confirm /usr/bin/thanos tools rules-check --rules recording_rules.yaml passes with these updated recording rules (no longer throws "ma" [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [17:29:07] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [17:29:55] (03PS2) 10Dzahn: remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405) [17:30:18] (03CR) 10Dzahn: [C: 03+2] remove racktables.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/883634 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:32:37] !log removing racktables.wikimedia.org from DNS - that's it for this ancient service T327405 [17:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:42] T327405: Decommission Racktables - https://phabricator.wikimedia.org/T327405 [17:32:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] fluent-bit: install wmf-certificates (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 (owner: 10Hnowlan) [17:34:45] (03PS2) 10Hnowlan: fluent-bit: install wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 [17:35:41] (03CR) 10Jgreen: [C: 03+2] Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 (owner: 10Jgreen) [17:35:51] (03PS2) 10Jgreen: Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 [17:36:02] (03CR) 10Jgreen: [V: 03+2] Switch payments.wikimedia.org back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/883635 (owner: 10Jgreen) [17:41:04] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) I just went through the logs now: ` Timestamp = 2023-01-25 14:07:50 Message = The server power action is initiated because the... [17:43:16] (03PS1) 10Btullis: Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783) [17:45:05] (03PS2) 10Btullis: Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783) [17:45:46] (03CR) 10Btullis: [C: 03+2] Increase the presto cluster size to 15 hosts again [puppet] - 10https://gerrit.wikimedia.org/r/883642 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [17:47:31] RECOVERY - Check systemd state on an-presto1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:53] RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:55] RECOVERY - Check systemd state on an-presto1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:57] RECOVERY - Check systemd state on an-presto1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:25] RECOVERY - Check systemd state on an-presto1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:27] RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:31] RECOVERY - Check systemd state on an-presto1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:48:49] RECOVERY - Check systemd state on an-presto1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [17:55:07] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) p:05Triage→03Medium [17:56:05] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [17:58:39] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [17:58:42] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS bullseye [17:58:48] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6010.drmrs.wmnet with OS bullseye completed: - cp6010 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1800) [18:00:26] 10SRE, 10Infrastructure-Foundations, 10netops: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10cmooney) Thanks for the feedback @aborrero. I'll plan on getting it decommissioned. [18:05:55] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6010.drmrs.wmnet [18:07:05] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [18:10:00] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [18:10:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:11:01] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:11:09] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:11:19] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:12:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:14:33] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:14:54] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS bullseye [18:15:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6002.drmrs.wmnet with OS bullseye [18:17:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:25:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:32:37] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [18:33:45] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [18:33:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [18:34:35] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [18:35:17] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [18:37:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:37:39] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [18:42:11] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:42:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [18:45:14] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:48:01] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:50:37] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski) [18:59:07] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS bullseye [19:00:05] brennen and jnuche: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1900). [19:00:05] brennen and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T1900). [19:00:17] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye [19:00:31] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host centrallog1002.eqiad.wmnet with OS bullseye [19:00:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6002.drmrs.wmnet with OS bullseye completed: - cp6002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:01:11] o/ [19:01:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:01:48] !log 1.40.0-wmf.20 train (T325583): no blockers, rolling to group1. [19:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:52] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:02:06] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583) [19:02:08] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:02:37] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) >>! In T316544#8558224, @aborrero wrote: >>>! In T316544#8557796, @cmooney wrote: >> Folks I was consider... [19:02:46] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883647 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:02:48] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) [19:04:39] RECOVERY - IPMI Sensor Status on an-worker1080 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:06:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:06:46] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6002.drmrs.wmnet [19:07:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:09:33] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:10:00] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.20 refs T325583 [19:10:04] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:10:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:12:38] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS bullseye [19:12:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6011.drmrs.wmnet with OS bullseye [19:13:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:14:53] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:16:01] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:17:04] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.20 refs T325583 (duration: 07m 04s) [19:17:08] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:17:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:29] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye [19:22:27] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:25:39] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:26:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:29:47] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:31:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:33:08] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [19:33:58] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage [19:34:50] (03PS1) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714) [19:35:28] (03PS1) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714) [19:36:10] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [19:38:36] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage [19:39:42] (03Abandoned) 10Dzahn: phabricator: set enable_vcs to false in main profile [puppet] - 10https://gerrit.wikimedia.org/r/864852 (owner: 10Dzahn) [19:44:28] (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [19:44:59] (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [19:50:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:52:57] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host centrallog1002.eqiad.wmnet with OS bullseye [19:55:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [19:58:14] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS bullseye [19:58:18] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6011.drmrs.wmnet with OS bullseye completed: - cp6011 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:58:54] (03PS1) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) [19:59:37] (03PS2) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) [20:00:38] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:00:47] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6011.drmrs.wmnet [20:03:01] (03PS3) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) [20:04:58] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:05:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:08:29] (03PS9) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [20:10:21] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS bullseye [20:10:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6003.drmrs.wmnet with OS bullseye [20:10:34] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39253/console" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:11:44] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:13:22] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:35] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [20:16:30] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:20:35] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [20:20:52] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:22:44] (03CR) 10Andrea Denisse: [C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:23:00] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39254/console" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:23:10] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:23:14] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:26:13] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/882747/39254/" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:29:44] PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafkatee.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:59] 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10BCornwall) This alerting would have been helpful for another [[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | a recent incident ]] of the same nature. [20:30:29] (03PS2) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) [20:30:35] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [20:32:01] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39255/console" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:32:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo-eqiad cluster: Reboot kafka nodes [20:33:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite) [20:33:46] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [20:34:08] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/882760/39255/" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:35:35] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39256/console" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:47:14] (03CR) 10Andrea Denisse: [V: 03+1] "This is the last CR for the failover." [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [20:49:01] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [20:49:08] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet [20:49:24] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [20:49:29] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet [20:49:58] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [20:50:07] I can deploy in 10 minutes, and will be starting the merge of some of the larger patches now in preparation [20:50:08] 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of 500 status - https://phabricator.wikimedia.org/T327960 (10Eevans) [20:50:24] 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of 500 status - https://phabricator.wikimedia.org/T327960 (10Eevans) p:05Triage→03Medium [20:50:31] 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) p:05Triage→03Medium [20:50:48] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [20:50:56] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [20:51:02] 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10Eevans) [20:51:20] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:54:56] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [20:56:09] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet [20:56:45] (03PS2) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714) [20:57:07] (03PS2) 10Jdrewniak: Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714) [20:58:13] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [20:58:48] (03PS3) 10Bking: flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) [20:59:22] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts cp2028.codfw.wmnet [20:59:24] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator: bump version to 1.3.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [20:59:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [20:59:31] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS bullseye [20:59:32] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [20:59:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6003.drmrs.wmnet with OS bullseye completed: - cp6003 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230125T2100). [21:00:05] Jan Drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] I can deploy! (and have already started the merge of 883619 and 883618) [21:00:22] Thanks TheresNoTime :) [21:01:08] TheresNoTime: thanks! I made a las-minute change to the backports, now I only need 883616 and 883617 [21:01:31] TheresNoTime: ah! let's try to cancel that merge! [21:01:44] (03CR) 10Jdrewniak: [C: 04-2] Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [21:01:47] jan_drewniak: damn [21:01:58] my bad, sorry [21:02:00] (03CR) 10Jdrewniak: [C: 04-2] Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [21:02:13] TheresNoTime: I think a -2 should do it [21:02:27] (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:02:31] (03PS10) 10Ottomata: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:03:49] TheresNoTime: sorry about that! these are really small changes so I merged them into one patch. [21:04:00] (03CR) 10Samtar: [C: 03+2] "deploy" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883616 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [21:04:08] (03CR) 10Samtar: [C: 03+2] "deploy" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883617 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [21:05:05] jan_drewniak: no worries :) I think that -2 will do it, but if not I'll revert :D [21:05:46] (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:06:04] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet [21:07:11] (03PS1) 10Bking: flink-operator: remove unnecessary newline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883657 (https://phabricator.wikimedia.org/T324576) [21:07:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-operator: remove unnecessary newline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883657 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:08:02] 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) I opened T327960 as well, it covers alerting based on the rate of HTTP status 500 responses. It is (currently) the case that every status 500 will //also// emit an error log, so it woul... [21:08:52] (03CR) 10Ottomata: [C: 03+1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:09:24] (03CR) 10Ottomata: [C: 03+2] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:09:38] (03CR) 10Ottomata: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:13:44] (03PS11) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [21:17:48] (03CR) 10Ottomata: [C: 03+2] flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:21:50] (03PS15) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [21:23:06] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:24:43] !log samtar@deploy1002 Started scap: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]] [21:24:47] T327714: Unexpected whitespace at the top of stub (short) articles in Vector 2022 - https://phabricator.wikimedia.org/T327714 [21:24:48] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:25:09] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:25:51] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39257/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [21:26:27] !log samtar@deploy1002 jdrewniak and samtar: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]] synced to the testservers: mwdebug2002.cod [21:26:27] fw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:26:43] jan_drewniak: those two are live on mwdebug, can you test? [21:27:37] TheresNoTime: perfect, looks good to sync! [21:27:42] ack [21:34:10] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:883617|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]], [[gerrit:883616|Define grid template row for .mw-body grid container to ensure the grid cell containing the content will expand in height when needed (T327714)]] (duration: 09m 27s) [21:34:15] T327714: Unexpected whitespace at the top of stub (short) articles in Vector 2022 - https://phabricator.wikimedia.org/T327714 [21:34:22] that's now live :) [21:34:57] TheresNoTime: awesome! thanks! [21:36:09] (03PS1) 10Ottomata: flink-app-example - set upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519) [21:37:33] (03CR) 10Bking: [C: 03+1] flink-app-example - set upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [21:41:33] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) pinged on Slack [21:42:00] (03PS16) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [21:42:14] (03CR) 10Ottomata: [C: 03+2] flink-app-example - set upgradeMode: stateless [deployment-charts] - 10https://gerrit.wikimedia.org/r/883660 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [21:43:27] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Dzahn) 05Open→03Resolved a:03Dzahn [21:43:44] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Dzahn) a:05Dzahn→03Ladsgroup [21:44:18] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [21:44:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [21:45:39] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39258/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [21:49:04] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [21:49:09] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [21:53:24] (03CR) 10Herron: [C: 03+1] "LGTM although please see commit msg nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [21:54:56] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:03:52] (03PS3) 10Andrea Denisse: centrallog: Add centrallog1002 to the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) [22:04:50] (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Add centrallog1002 to the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:05:26] (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Add centrallog1002 to the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:07:31] (03CR) 10Herron: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:10:17] (03CR) 10Herron: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:11:54] (03PS3) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) [22:13:13] (03CR) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:13:47] (03CR) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:14:50] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6003.drmrs.wmnet [22:14:50] (03CR) 10Herron: [C: 03+1] "LGTM 🪵" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:17:25] (03PS3) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) [22:18:43] (03CR) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:19:07] (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:20:30] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Peachey88) [22:20:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:21:21] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS bullseye [22:21:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6012.drmrs.wmnet with OS bullseye [22:22:30] (03PS4) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) [22:24:19] (03CR) 10Andrea Denisse: "It makes more sense to me to add a destination first and perform the the failover of centrallog1001 -> centrallog1002 in another patch aft" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [22:25:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:33] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [22:29:08] (03PS1) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 [22:30:18] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [22:30:55] (03CR) 10Urbanecm: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [22:31:02] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [22:31:30] (03CR) 10Jdrewniak: [C: 03+2] Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson) [22:32:13] (03Merged) 10jenkins-bot: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson) [22:32:42] (03CR) 10Zabe: Enable ResourceLoader client preferences on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson) [22:33:28] (03PS2) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) [22:34:02] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [22:34:37] (03CR) 10CI reject: [V: 04-1] Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [22:36:29] 10SRE, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10BCornwall) If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the status code. [22:40:13] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [22:43:16] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [22:43:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:48:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:49:25] (03PS3) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) [22:51:26] jan_drewniak, ^ that config patch also enables resourceloade client preferences on prod (unlike the commit message says), was that intended? [23:03:46] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:04:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:04:52] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [23:07:20] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS bullseye [23:07:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6012.drmrs.wmnet with OS bullseye completed: - cp6012 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [23:07:30] (03CR) 10Urbanecm: [C: 03+1] "selected namespace IDs are free (https://shn.wikibooks.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces), LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) (owner: 10Superpes15) [23:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:10:30] RECOVERY - Check systemd state on mw2293 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:31] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6012.drmrs.wmnet [23:14:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:14:56] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS bullseye [23:15:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6004.drmrs.wmnet with OS bullseye [23:18:44] (03PS1) 10Zabe: Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676 [23:18:59] (03CR) 10Zabe: [C: 03+2] Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676 (owner: 10Zabe) [23:19:47] (03Merged) 10jenkins-bot: Revert "Enable ResourceLoader client preferences on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883676 (owner: 10Zabe) [23:20:08] !log zabe@deploy1002 Backport cancelled. [23:21:33] !log zabe@deploy1002 Started scap: (no justification provided) [23:29:07] !log zabe@deploy1002 Finished scap: (no justification provided) (duration: 07m 34s) [23:31:40] (03CR) 10Zabe: Enable ResourceLoader client preferences on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883668 (owner: 10Jdlrobson) [23:33:18] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [23:36:20] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [23:37:08] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms [23:37:49] (03PS1) 10Ebernhardson: Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) [23:46:56] (03PS1) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) [23:47:16] (03CR) 10CI reject: [V: 04-1] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [23:54:34] zabe: apologies! you're right that patch affected prod when it shouldn't have :( [23:57:09] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS bullseye [23:57:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6004.drmrs.wmnet with OS bullseye completed: - cp6004 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [23:57:39] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6004.drmrs.wmnet [23:58:20] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)