[00:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:22:10] (03PS9) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [00:22:53] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:24:21] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [00:26:31] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:29:45] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add two new cache configuration parameters for superset 1.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/867291 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis) [00:33:21] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:33:49] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1004 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:34:45] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:36:56] (03CR) 10Eevans: "PCC output: https://puppet-compiler.wmflabs.org/output/866640/1500/" [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [00:38:13] RECOVERY - ElasticSearch unassigned shard check - 9400 on relforge1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:38:13] RECOVERY - ElasticSearch unassigned shard check - 9400 on relforge1004 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:39:16] (03PS1) 10Btullis: Update the superset config to identify the cache types correctly [puppet] - 10https://gerrit.wikimedia.org/r/867297 (https://phabricator.wikimedia.org/T323458) [00:40:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38727/console" [puppet] - 10https://gerrit.wikimedia.org/r/867297 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis) [00:41:42] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the superset config to identify the cache types correctly [puppet] - 10https://gerrit.wikimedia.org/r/867297 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis) [00:45:35] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:51] (03PS1) 10Jdlrobson: Child elements also trigger previews [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867233 (https://phabricator.wikimedia.org/T325007) [01:41:46] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:00] (03CR) 10RLazarus: "The new layout looks great!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0300) [03:07:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.14 [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/866820 (https://phabricator.wikimedia.org/T320519) [03:07:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.14 [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/866820 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [03:08:22] (03PS1) 10PleaseStand: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) [03:12:11] (03PS2) 10PleaseStand: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) [03:24:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.14 [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/866820 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [03:41:30] (03PS2) 10KartikMistry: Enable Section Translation in Chuvash Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867002 (https://phabricator.wikimedia.org/T319176) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0400) [04:01:23] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867308 (https://phabricator.wikimedia.org/T320519) [04:01:25] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867308 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [04:02:01] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867308 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [04:02:30] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.14 refs T320519 [04:02:34] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [04:15:43] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:39:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:41:47] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:54:41] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.14 refs T320519 (duration: 52m 11s) [04:54:45] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [04:56:58] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.12 (duration: 02m 15s) [05:11:45] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled ht [05:16:19] kitech.wikimedia.org/wiki/PyBal [05:16:47] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:17:31] morning [05:17:34] I'm here [05:18:09] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:18:37] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:20:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:21:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:01] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:26:51] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:47:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:35] (03PS1) 10DLynch: Complete deployment of DiscussionTools reply visual enhancements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T321955) [06:00:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:09] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:28:46] (03PS3) 10KartikMistry: Update cxserver to 2022-12-06-121330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) [06:43:20] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-12-06-121330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry) [06:43:22] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10LDAP-Access-Requests, 10WMF-Communications: Grant Access to staff LDAP group for Sbenchagra - https://phabricator.wikimedia.org/T324696 (10RhinosF1) >>! In T324696#8462191, @Varnent wrote: > @jhathaway - Apologies - have added links to that tem... [06:47:45] (03Merged) 10jenkins-bot: Update cxserver to 2022-12-06-121330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry) [06:50:45] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:51:16] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:53:42] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:54:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42676 and previous config saved to /var/cache/conftool/dbconfig/20221213-065402-marostegui.json [06:54:23] !log Reboot db1206 to test RAID controller [06:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:35] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:56:35] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:57:29] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:59:12] !log Updated cxserver to 2022-12-06-121330-production (T321781, T324534) [06:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:17] T321781: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 [06:59:17] T324534: cxserver: Update Flores/NLLB-200 MT secret in Production - https://phabricator.wikimedia.org/T324534 [06:59:28] Thanks a lot Amir1 [06:59:39] happy to be useful sometimes [06:59:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42677 and previous config saved to /var/cache/conftool/dbconfig/20221213-065949-root.json [07:00:04] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0700). [07:04:37] (03PS1) 10Marostegui: admin: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/867451 [07:05:43] (03CR) 10Ladsgroup: [C: 03+1] admin: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/867451 (owner: 10Marostegui) [07:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42678 and previous config saved to /var/cache/conftool/dbconfig/20221213-071454-root.json [07:26:07] (03CR) 10Marostegui: [C: 03+2] admin: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/867451 (owner: 10Marostegui) [07:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42679 and previous config saved to /var/cache/conftool/dbconfig/20221213-072959-root.json [07:41:41] (03CR) 10Ladsgroup: [C: 03+2] Reduce PC writes from parsoid API to 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867262 (owner: 10Daniel Kinzler) [07:42:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867262 (owner: 10Daniel Kinzler) [07:42:25] (03Merged) 10jenkins-bot: Reduce PC writes from parsoid API to 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867262 (owner: 10Daniel Kinzler) [07:43:09] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867262|Reduce PC writes from parsoid API to 1%]] [07:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42680 and previous config saved to /var/cache/conftool/dbconfig/20221213-074504-root.json [07:45:07] !log ladsgroup@deploy1002 ladsgroup and daniel: Backport for [[gerrit:867262|Reduce PC writes from parsoid API to 1%]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:52:45] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867262|Reduce PC writes from parsoid API to 1%]] (duration: 09m 35s) [08:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42681 and previous config saved to /var/cache/conftool/dbconfig/20221213-080009-root.json [08:00:17] (03PS3) 10KartikMistry: Enable Section Translation in Chuvash Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867002 (https://phabricator.wikimedia.org/T319176) [08:00:26] * kart_ is here [08:02:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867002 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry) [08:03:11] (03Merged) 10jenkins-bot: Enable Section Translation in Chuvash Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867002 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry) [08:03:40] !log kartik@deploy1002 Started scap: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] [08:03:44] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [08:05:26] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:13:41] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] (duration: 10m 01s) [08:13:45] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [08:15:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42682 and previous config saved to /var/cache/conftool/dbconfig/20221213-081514-root.json [08:19:43] (03PS1) 10Slyngshede: P:aptrepo::wikimedia add ldap3 component [puppet] - 10https://gerrit.wikimedia.org/r/867521 [08:19:57] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38728/console" [puppet] - 10https://gerrit.wikimedia.org/r/867521 (owner: 10Slyngshede) [08:22:02] (03PS6) 10JMeybohm: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:23:21] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38729/console" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:23:32] (03CR) 10JMeybohm: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:25:03] (03PS10) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [08:25:22] (03CR) 10CI reject: [V: 04-1] Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:26:35] (03PS7) 10JMeybohm: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:27:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38731/console" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42683 and previous config saved to /var/cache/conftool/dbconfig/20221213-083019-root.json [08:31:41] (03CR) 10JMeybohm: [V: 03+1] kubernetes: Use netbox data to populate topology labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [08:34:12] (03PS4) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 [08:36:06] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [08:36:48] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:42:59] (03CR) 10Joal: "Open question: Since now we gzip the fsimages, would we only keep the gzipped version instead of both the raw and the compressed ones?" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:43:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867521 (owner: 10Slyngshede) [08:43:45] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:aptrepo::wikimedia add ldap3 component [puppet] - 10https://gerrit.wikimedia.org/r/867521 (owner: 10Slyngshede) [08:45:53] (03PS11) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [08:49:45] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38732/console" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:53:00] (03CR) 10Ayounsi: "Some comments inline. Overall this is going in the good direction!" [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [08:55:15] !log installing xen security updates [08:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:03] (03PS12) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [08:57:10] (03CR) 10Aqu: "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38733/console" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:58:23] !log installing libpgjava security updates [08:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:05] hashar and ^demon: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0900). [09:01:58] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet [09:02:49] (03CR) 10Hashar: "The reason for the extra entry was to save the Jenkins config file more often." [puppet] - 10https://gerrit.wikimedia.org/r/867246 (owner: 10Dzahn) [09:04:06] I am going to run the train ;) [09:04:33] hashar gimme a sec I have a server to repool [09:05:27] sure thing [09:05:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet [09:05:32] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet [09:06:45] RECOVERY - mediawiki-installation DSH group on parse1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:07:51] !log Repooled parse1002.eqiad.wmnet in parsoid service - T324949 [09:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:55] T324949: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 [09:07:59] hashar: All god [09:08:01] good* [09:08:18] (03PS13) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [09:09:06] in good we trust [09:09:14] I am rolling wmf.14 to group0 now [09:09:19] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867522 (https://phabricator.wikimedia.org/T320519) [09:09:23] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867522 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [09:09:31] (03PS1) 10Slyngshede: C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [09:09:31] hashar: E pluribus trainum [09:09:49] (03CR) 10CI reject: [V: 04-1] C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:09:57] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867522 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [09:10:03] (03CR) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:13:11] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [09:13:41] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [09:13:47] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) [09:14:58] (03PS2) 10Slyngshede: C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [09:15:31] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:48] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:17:36] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.14 refs T320519 [09:17:40] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [09:18:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38735/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:19:12] (03CR) 10Muehlenhoff: C:ldap::management Install and configure bitu-ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:20:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38736/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:21:27] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [09:23:11] (03CR) 10David Caro: [C: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:23:26] (03PS14) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [09:25:16] (03PS3) 10Slyngshede: C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [09:25:35] (03CR) 10CI reject: [V: 04-1] C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:25:48] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38737/console" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:25:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38738/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:26:15] (03PS1) 10Alexandros Kosiaris: grafana: Explicitly set default theme to light [puppet] - 10https://gerrit.wikimedia.org/r/867527 [09:26:32] (03PS4) 10Slyngshede: C:ldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [09:27:40] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38739/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:29:39] (03PS1) 10Func: RangeChronologicalPager: Restore the compatibility with derived classes [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867237 (https://phabricator.wikimedia.org/T228431) [09:32:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (per https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#default_theme)" [puppet] - 10https://gerrit.wikimedia.org/r/867527 (owner: 10Alexandros Kosiaris) [09:32:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38740/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:33:02] (03CR) 10Aqu: [V: 03+1] "OK, the last patch is gunzipping the backup at creation (not only on Monday)." [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:37:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:39:10] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) JTAC wants us to try to re-seat the linecard before doing any RMA. Work scheduled for Jan 12th. Opened procurement {T325048} for the remote hands work. [09:39:18] (03CR) 10Jaime Nuche: [C: 04-1] "Thanks for this patch Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [09:39:20] (03CR) 10Hashar: [C: 03+1] grafana: Explicitly set default theme to light [puppet] - 10https://gerrit.wikimedia.org/r/867527 (owner: 10Alexandros Kosiaris) [09:39:39] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:40:49] maybe that one complained cause the train deployment resulted in a small latency bump [09:40:54] (fixed url https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=POST&viewPanel=9&var-site=All&var-code=200&var-php_version=All ) [09:41:51] the latency is now lower by 10ms (80ms > 70ms) [09:45:31] (03CR) 10Muehlenhoff: C:ldap::management Install and configure bitu-ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [09:53:25] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) Thanks a lot for adding the new identity @Dzahn I don't know if there's another way to grant a... [09:53:35] jouncebot: nowandnext [09:53:35] For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T0900) [09:53:35] In 1 hour(s) and 6 minute(s): MediaWiki on kubernetes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1100) [09:54:43] !log installing libhttp-daemon-perl security updates [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] <_joe_> hashar: are you done with the train? [09:57:01] (03PS2) 10Muehlenhoff: Add Cumin alias for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/857017 [09:58:20] _joe_: yeah [09:58:35] (03PS1) 10JMeybohm: calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) [09:59:23] (03CR) 10CI reject: [V: 04-1] calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) (owner: 10JMeybohm) [10:00:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:01:24] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:01:32] (03CR) 10Dreamy Jazz: "Thanks for the work on this. This patch fixes the issue. Would appreciate a merge as there will be no train deploys for at least 2 weeks, " [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867237 (https://phabricator.wikimedia.org/T228431) (owner: 10Func) [10:06:12] (03CR) 10Func: RangeChronologicalPager: Restore the compatibility with derived classes (031 comment) [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867237 (https://phabricator.wikimedia.org/T228431) (owner: 10Func) [10:06:41] <_joe_> vgutierrez, claime so we can start early if you're both available [10:06:59] <_joe_> my idea for deploying was: disable puppet on all cp::text nodes [10:07:12] <_joe_> merge the patch, run puppet on one ULSFO node [10:07:14] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/857017 (owner: 10Muehlenhoff) [10:07:23] sounds good [10:07:27] <_joe_> do a smoke test that at least we can reach the API and the sites [10:07:30] <_joe_> deploy everywhere [10:08:12] ulsfo is running the single backend experiment so makes tests even easier/safer [10:09:15] <_joe_> I intended to cache-bust but yeah :P [10:09:40] no problem [10:09:52] just saying that varnish -> ats-be won't jump boxes [10:10:47] (03PS3) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) [10:12:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: move test2wiki to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [10:13:54] (03CR) 10Jbond: base::cloud_production: introduce new profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:14:05] (03PS3) 10Jbond: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:14:27] <_joe_> ok, running puppet on cp4037 [10:15:10] ack [10:15:11] ack [10:15:30] <_joe_> and btw, I'm gonna query ats directly there [10:15:39] ack [10:15:51] :3128 and -X 'X-Forwarded-Proto: https' but you already know that [10:15:56] -H even :) [10:16:09] (03CR) 10Elukey: [C: 03+1] k8s: Add support for PKI with k8s >= 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:16:13] vgutierrez: I don't :D [10:16:32] <_joe_> https://phabricator.wikimedia.org/P42684 [10:16:59] <_joe_> claime: can you check that mediawiki-main-6bb754cd75-js25x is in mw-web in codfw please? [10:17:04] (03CR) 10Elukey: [C: 03+1] k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:17:07] yes [10:17:49] _joe_: how do you map containers to clusters? [10:17:49] _joe_: confirmed [10:17:50] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Apache on VRTS [puppet] - 10https://gerrit.wikimedia.org/r/865674 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:18:04] <_joe_> amazing [10:18:08] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [10:18:13] <_joe_> vgutierrez: with kubectl get pods [10:18:15] (03CR) 10FNegri: "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:18:28] <_joe_> but yeah we might need to add the cluster to the release name [10:18:42] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon All thanos nodes now bullseye. [10:18:43] vgutierrez: basically `kube-env mw-web codfw; kubectl get pods | grep mediawiki-main-6bb754cd75-js25x` [10:18:46] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10MatthewVernon) [10:18:46] <_joe_> the name should be something like [10:18:57] <_joe_> mediawiki-main-6bb754cd75-js25x.mw-web.codfw [10:19:16] <_joe_> although logstash already has all that info [10:19:24] So add .Release.Namespace and .Release.Environment iirc [10:19:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 31): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38741/console" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:19:31] <_joe_> claime: exactly [10:19:32] claime: hmmm that implies that I should know that it could be on mw-web [10:19:50] vgutierrez: Hence the above proposal [10:19:55] yep [10:19:56] <_joe_> vgutierrez: logstash has all the info but yeah, this is an improvement [10:20:15] This is a kinda quick fix, but may require a chart bump [10:20:23] <_joe_> yes, and some caution [10:20:30] I'll check it out once we're done [10:20:54] <_joe_> I'm not 100% sure it's possible to do as i just suggested btw [10:21:35] !lod puppet disabled on cp hosts for T290536 [10:21:35] T290536: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 [10:21:41] (03CR) 10Elukey: kubernetes: Use netbox data to populate topology labels (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [10:21:44] !log puppet disabled on cp hosts for T290536 [10:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:00] _joe_: It's not for now, I'll check it out later [10:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:38] !log puppet run on cp4037 - T290536 [10:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:50] There :D [10:22:53] <_joe_> uh [10:23:03] <_joe_> why that alert? [10:23:09] hmmm that's weird [10:23:15] <_joe_> I didn't run puppet there [10:23:21] I think it's a coincidence [10:23:23] exporters are healthy [10:23:27] It's been flapping all the time [10:23:36] but prometheus isn't able to reacch them for some reason? [10:23:51] I think I pinged s.ukhe about it last week [10:24:04] moritzm: have you been working on ganeti5xxx instances lately? [10:24:37] anyways.. I'll follow later, I've been out last week so maybe I'm missing some context [10:25:29] <_joe_> claime: is mediawiki-main-796c8cff88-4gl5q in mw-web in eqiad? [10:26:05] _joe_: yup [10:26:10] <_joe_> great [10:26:15] <_joe_> ok now the api endpoints [10:26:52] <_joe_> mediawiki-main-5c985d9499-nncjx should be in mw-api-ext eqiad [10:27:09] _joe_: yup [10:27:13] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38746/console" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:27:19] <_joe_> uhm, problem [10:27:28] <_joe_> api.php GETs don't work [10:28:41] <_joe_> SIGH [10:28:46] <_joe_> fix incoming [10:29:09] (03Abandoned) 10Volans: cumin::cloud_target: add a new profile [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:29:47] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [10:30:01] (03PS1) 10Giuseppe Lavagetto: trafficserver: fix API ro hostname for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/867532 [10:30:09] <_joe_> claime: ^^ [10:30:16] * _joe_ puts brown paper bag on [10:30:32] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: fix API ro hostname for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/867532 (owner: 10Giuseppe Lavagetto) [10:30:36] lol [10:30:44] so close [10:30:49] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: fix API ro hostname for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/867532 (owner: 10Giuseppe Lavagetto) [10:30:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: fix API ro hostname for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/867532 (owner: 10Giuseppe Lavagetto) [10:31:01] _joe_: missing task on commit message [nitpick] [10:31:30] <_joe_> vgutierrez: meh it's a one-line fix and it's affecting users of test2wiki :P [10:31:43] Same problem L370 BTW [10:32:03] <_joe_> sigh yes [10:32:11] 413 [10:32:19] 428 [10:32:31] <_joe_> no it's not [10:32:33] idk who did taht code review (me) but they sure done goofed [10:32:36] <_joe_> those are on-prem [10:32:44] Oh right [10:32:45] <_joe_> those are correct [10:32:47] (03PS21) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [10:32:51] yea ya [10:32:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [10:33:02] I figured it out right as you were saying it [10:33:32] (03PS1) 10Giuseppe Lavagetto: trafficserver: further fix of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/867534 [10:34:04] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: further fix of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/867534 (owner: 10Giuseppe Lavagetto) [10:34:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: further fix of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/867534 (owner: 10Giuseppe Lavagetto) [10:34:38] (03CR) 10Jbond: Example strategy for marking DSCP with ferm and puppet integration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:34:46] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:36:12] !log clean up stale prometheus target files in prometheus5001 [10:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] (JobUnavailable) resolved: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:37:12] decommissioned hosts still referenced in stale prometheus target files [10:37:13] <_joe_> claime: mediawiki-main-64579c8868-vvq74 is in mw-api-ext in codfw, correct? [10:37:22] (03PS4) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [10:37:51] _joe_: yes [10:38:27] (03CR) 10Ladsgroup: [C: 03+1] grafana: Explicitly set default theme to light [puppet] - 10https://gerrit.wikimedia.org/r/867527 (owner: 10Alexandros Kosiaris) [10:38:45] <_joe_> ok now testing rest.php [10:39:41] (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: remove resources from deployment server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [10:40:01] <_joe_> ok rest.php is ok as well [10:40:07] <_joe_> routing seems to be working correctly [10:40:19] nice [10:40:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] base::cloud_production: introduce new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:40:54] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:41:04] cool :d [10:41:06] (03PS22) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [10:41:42] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:43:17] <_joe_> now testing X-W-D [10:43:30] <_joe_> I seem to be doing something wrong though [10:44:22] VERBOSE=1 what do you mean _joe_? [10:44:42] <_joe_> vgutierrez: meh I was writing wmdebug [10:44:53] * vgutierrez sends some coffee to _joe_ [10:45:00] Fingers am I right? [10:45:24] (03PS5) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [10:46:02] <_joe_> interestingly if I select mwdebug1001 from cp4037 the request hangs [10:46:15] <_joe_> and I get a 502 [10:46:17] huh [10:46:25] <_joe_> uhm is mwdebug1001 down or something? [10:46:42] no it's up [10:47:06] _joe_: :) [10:47:08] it's a port issue [10:47:31] <_joe_> ahhh fuck you're right [10:47:36] <_joe_> enwiki works [10:47:45] <_joe_> ok we can live with this I think [10:47:54] (03PS6) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [10:47:54] <_joe_> and fix it after the release [10:48:05] vgutierrez: can you expand a bit please? [10:48:08] <_joe_> vgutierrez: do you prefer me to send a fix before? it's a tad complex [10:48:42] <_joe_> claime: we're first switching to mw-web:4450 [10:48:42] (03PS23) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [10:48:50] <_joe_> then switching mw-web for mwdebug1001 [10:48:54] claime: x-wikimedia-debug-routing.lua just replaces the hostname [10:48:59] <_joe_> so we call mwdebug1001:4450 [10:49:00] claime: not the port part [10:49:02] Ah, right [10:49:04] <_joe_> vgutierrez: yeah unless you use k8s [10:49:06] Gotcha [10:49:11] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:49:12] claime: mwdebug1001 expects traffic on 443 [10:49:12] <_joe_> it's not hard to fix I think [10:49:21] and your mw k8s cluster uses another port [10:49:24] hence the issue [10:49:28] Thanks [10:49:36] _joe_: last famous words(TM) [10:49:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:50:15] _joe_: happy to proceed like this assuming that you're gonna fix it today [10:51:43] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@e988b5e]: Relax sla for the weekly es transfer and subgraph_and_query_metrics [10:51:48] _joe_: so.. in x-wm-debug-routing.lua... if ts.client_request.get_url() == testwiki --> set_url_port() [10:52:12] that would be the naive approach [10:52:24] (03PS1) 10Giuseppe Lavagetto: trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) [10:52:29] <_joe_> vgutierrez: ^^ :P [10:53:00] <_joe_> it's more general yet still naive enough it doesn't require a lot of work [10:53:13] makes sense [10:53:28] hmm yep [10:53:37] <_joe_> ok last test I want to run: httpbb [10:54:08] (03CR) 10CI reject: [V: 04-1] trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [10:54:09] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@e988b5e]: Relax sla for the weekly es transfer and subgraph_and_query_metrics (duration: 02m 25s) [10:54:19] (03CR) 10Arturo Borrero Gonzalez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:54:24] So point httpbb at cp4037 ? [10:54:26] (03CR) 10FNegri: base::cloud_production: introduce new profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:54:28] <_joe_> httpbb --hosts cp4037.ulsfo.wmnet /srv/deployment/httpbb-tests/appserver/*.yaml [10:54:29] (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:54:35] <_joe_> not sure all tests will work [10:54:45] Guess we'll see [10:54:53] That's what tests are for :p [10:55:11] _joe_: you need to amend x-wikimedia-debug-routing_test.lua as well [10:55:35] <_joe_> vgutierrez: why? [10:55:59] <_joe_> claime: three tests failed for perfectly valid reasons, everything is good AFAICT [10:56:15] It appears nowhere in puppet except the file being there in modules/profile/files/trafficserver/ [10:56:29] <_joe_> vgutierrez: I mean, you want me to write another test? [10:56:34] Ah! [10:56:48] It's an actual test file [10:56:53] (03PS1) 10Jbond: sre.hardware: support 4 diget version numbers for network drivers [cookbooks] - 10https://gerrit.wikimedia.org/r/867538 (https://phabricator.wikimedia.org/T324606) [10:57:17] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@27ac6d3] (codfw): Increase codfw mirrored traffic to 100% [10:57:52] I'd say you can just add checks to see if it sets the ports correctly in each scenario [10:57:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [10:58:13] <_joe_> claime: yeah [10:58:24] (03CR) 10Arturo Borrero Gonzalez: "A PCC run: https://puppet-compiler.wmflabs.org/output/861902/38748/" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:58:58] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@27ac6d3] (codfw): Increase codfw mirrored traffic to 100% (duration: 01m 40s) [11:00:04] _joe_ and claime: May I have your attention please! MediaWiki on kubernetes. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1100) [11:00:24] yes yes jounce we're on it already. gosh these AIs [11:00:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:01:59] (03PS2) 10Giuseppe Lavagetto: trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) [11:02:01] (03PS1) 10Giuseppe Lavagetto: trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 [11:02:08] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/867538 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [11:02:13] (03PS5) 10Slyngshede: P:openldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [11:02:23] <_joe_> vgutierrez: amended the tests, can you take a look? [11:02:54] <_joe_> if you like if now, I'd merge it, verify x-w-d isn't broken anymore for test2wiki, open puppet again everywhere [11:03:07] (03PS2) 10JMeybohm: calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) [11:03:56] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [11:04:24] (03CR) 10CI reject: [V: 04-1] trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 (owner: 10Giuseppe Lavagetto) [11:04:26] (03CR) 10CI reject: [V: 04-1] trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [11:04:46] (03PS6) 10Slyngshede: P:openldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 [11:08:26] (03PS3) 10Giuseppe Lavagetto: trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) [11:11:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: always set the port explicitly in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867537 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [11:12:37] (03CR) 10Clément Goubert: mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [11:14:06] <_joe_> ok now everything works [11:14:25] <_joe_> I'll reenable puppet everywhere [11:14:32] <_joe_> claime: any reason not to? [11:15:08] not that I can think of [11:15:12] <_joe_> ok, done [11:16:28] (03PS8) 10JMeybohm: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [11:16:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift-account-stats_search:platform.service,swift-account-stats_swift:dispersion.service,swift-account-stats_tegola:prod.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:26] (03CR) 10JMeybohm: kubernetes: Use netbox data to populate topology labels (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [11:17:47] !log Puppet re-enabled on cp::text nodes - T290536 [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] T290536: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 [11:18:24] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) [11:18:34] (03PS20) 10Hashar: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) [11:18:36] (03PS10) 10Hashar: Add unit testing with QUnit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 [11:18:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38750/console" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [11:20:06] (03CR) 10Volans: "Ok the patch should be in good shape now, PCC is happy. Waiting for consensus before merging." [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:20:08] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [11:20:28] (03CR) 10Hashar: Replace CI results table by Gerrit Check API (032 comments) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [11:20:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [11:21:07] (03CR) 10Hashar: "Patchset 10 covers overriding the pipeline message by the job message if there is any send by Zuul." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [11:21:14] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2045 & mc2046 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867545 (https://phabricator.wikimedia.org/T293012) [11:22:47] !log installing paramiko security updates# [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:29] (03PS7) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [11:25:48] (03CR) 10Elukey: kubernetes: Use netbox data to populate topology labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [11:27:25] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:16] (03PS37) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:29:18] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:30:27] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:31:20] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:31:47] (03PS38) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:33:27] (03CR) 10Elukey: [C: 03+1] calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) (owner: 10JMeybohm) [11:33:56] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:34:30] (03PS39) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:34:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38751/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [11:35:48] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38752/console" [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [11:35:52] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2045 & mc2046 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867545 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [11:36:38] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:36:44] (03CR) 10Slyngshede: [V: 03+1] P:openldap::management Install and configure bitu-ldap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [11:36:58] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:openldap::management Install and configure bitu-ldap [puppet] - 10https://gerrit.wikimedia.org/r/867523 (owner: 10Slyngshede) [11:37:59] slyngs: I was faster, shall I merge your patch ? [11:38:10] Aah, yes please :-) [11:38:34] Wrote to you in wikimedia-sre [11:40:21] (03CR) 10Lucas Werkmeister (WMDE): query_service: support downloads in query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867142 (https://phabricator.wikimedia.org/T323451) (owner: 10Lucas Werkmeister (WMDE)) [11:41:21] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: enable pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/867547 [11:41:38] (03PS40) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:42:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o [11:42:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o [11:42:52] (03CR) 10David Caro: [V: 03+1] "This current patch passes all the tests:" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [11:43:16] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: enable pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/867547 (owner: 10Effie Mouzeli) [11:43:43] (03PS1) 10Jcrespo: mariadb: Reduce memory committment of db2100 to reserve it for backups [puppet] - 10https://gerrit.wikimedia.org/r/867548 [11:43:46] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:46:36] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: enable pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/867547 (owner: 10Effie Mouzeli) [11:46:54] (03CR) 10Marostegui: [C: 03+1] mariadb: Reduce memory committment of db2100 to reserve it for backups [puppet] - 10https://gerrit.wikimedia.org/r/867548 (owner: 10Jcrespo) [11:47:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:23] (03CR) 10Hashar: [C: 03+2] "I had various feedback and patched them up and I elect this as good enough. There must be a few fix to conduct but it will be easier as f" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [11:47:53] (03Merged) 10jenkins-bot: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [11:47:56] (03CR) 10Hashar: [C: 03+2] "Thanks for all the feedback and Timo and your guidance with javascript, ES6 and QUnit!" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [11:48:07] (03PS2) 10Jcrespo: mariadb: Reduce memory commitment of db2100 to reserve it for backups [puppet] - 10https://gerrit.wikimedia.org/r/867548 [11:48:31] (03Merged) 10jenkins-bot: Add unit testing with QUnit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [11:48:34] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reduce memory commitment of db2100 to reserve it for backups [puppet] - 10https://gerrit.wikimedia.org/r/867548 (owner: 10Jcrespo) [11:51:03] I am doing a Gerrit plugin deployment and will restart it in a few minutes [11:51:34] (03Merged) 10jenkins-bot: tegola-vector-tiles: enable pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/867547 (owner: 10Effie Mouzeli) [11:52:04] !log hashar@deploy1002 Started deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 [11:52:04] (03PS2) 10Jbond: sre.hardware: support 4 digit version numbers for network drivers [cookbooks] - 10https://gerrit.wikimedia.org/r/867538 (https://phabricator.wikimedia.org/T324606) [11:52:07] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [11:52:14] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 (duration: 00m 11s) [11:52:28] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 [11:52:35] (03PS1) 10Volans: cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) [11:54:06] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond) [11:54:10] !log Restarted Gerrit on gerrit2002 (replica) [11:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:04] !log hashar@deploy1002 Started deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 [11:55:13] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 (duration: 00m 09s) [11:55:21] stopping gerrit [11:55:27] will be back in a couple minutes [11:57:41] !log Restarted Gerrit on gerrit1001 [11:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:12] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:58:40] (03PS1) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [11:58:59] (03CR) 10CI reject: [V: 04-1] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (owner: 10David Caro) [12:00:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:11] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: restart [12:00:25] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: restart [12:00:56] (03CR) 10Jbond: [C: 03+2] sre.hardware: support 4 digit version numbers for network drivers [cookbooks] - 10https://gerrit.wikimedia.org/r/867538 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [12:02:39] (03Merged) 10jenkins-bot: sre.hardware: support 4 digit version numbers for network drivers [cookbooks] - 10https://gerrit.wikimedia.org/r/867538 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [12:04:33] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) [12:04:35] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 [12:04:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) [12:05:31] (03CR) 10JMeybohm: [C: 03+2] k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:05:34] (03CR) 10JMeybohm: [C: 03+2] k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:05:37] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [12:07:31] !log Gerrit now has CI job results represented in the Checks tab which should be a little nicer. The old HTML result table is gone and replaced by little bubbles representing the state of the builds for the latest patchset. Ref: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/3ULF5NPVC4MSVABZBSXAMDODLZUKFXHS/ [12:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:36] (03PS2) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [12:07:56] (03CR) 10CI reject: [V: 04-1] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (owner: 10David Caro) [12:09:21] (03PS3) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [12:09:23] (03PS1) 10Slyngshede: C:ldap::management deploy updated modify-mfa tool. [puppet] - 10https://gerrit.wikimedia.org/r/867568 [12:09:38] (03CR) 10David Caro: [V: 03+1] "works! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/867566 (owner: 10David Caro) [12:12:20] (03CR) 10Slyngshede: "Same version as attempted deployed yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/867568 (owner: 10Slyngshede) [12:15:18] (ProbeDown) firing: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#sessionstore:8081 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:18] (ProbeDown) firing: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#sessionstore:8081 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:32] oops [12:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:16:12] akosiaris: here if needed [12:16:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=204 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:16:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:27] I see an increase in sessionstore rps as well as potential crash? [12:16:33] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-15m&to=now [12:16:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:16:57] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:17:03] it does appear to subside already though in the graphs [12:17:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:18:07] Yeah everything else comes down to a potential sessionstore crash [12:18:08] akosiaris: claime that could be me [12:18:22] mean latency in eqiad increase too [12:18:29] well...I just merged a couple of k8s patches [12:18:38] (03PS2) 10Jaime Nuche: mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 [12:18:45] (JobUnavailable) firing: Reduced availability for job swagger_check_sessionstore_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:18:56] akosiaris: weren't you rebooting redises? [12:19:05] no [12:19:11] I waited for an all clear first [12:19:20] (03CR) 10Jaime Nuche: mwdebug_deploy: remove resources from deployment server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [12:20:22] Seems, I can't save any edits. [12:20:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:20:42] cgoubert@deploy1002:~$ kube-env sessionstore eqiad [12:20:43] jayme: kubectl get pods is pretty bad [12:20:44] cgoubert@deploy1002:~$ kubectl get po [12:20:46] NAME READY STATUS RESTARTS AGE [12:20:48] kask-production-6476b88b99-2ft42 0/1 Pending 0 10m [12:20:50] kask-production-6476b88b99-5swm7 0/1 Pending 0 8m27s [12:20:52] kask-production-6476b88b99-6vj2j 0/1 MatchNodeSelector 0 11m [12:20:54] kask-production-6476b88b99-9928x 0/1 MatchNodeSelector 0 20d [12:20:56] kask-production-6476b88b99-cjg54 0/1 MatchNodeSelector 0 11m [12:20:58] everything is either pending or not matching nodeselector [12:20:58] kask-production-6476b88b99-dcnxs 0/1 Pending 0 10m [12:21:00] kask-production-6476b88b99-dk9dq 0/1 MatchNodeSelector 0 20d [12:21:02] kask-production-6476b88b99-dlzrn 0/1 Pending 0 8m28s [12:21:04] kask-production-6476b88b99-dqrq9 0/1 MatchNodeSelector 0 20d [12:21:06] "We could not save your edit because the session was no longer valid. You are no longer logged in. Please log back in from a different tab and try again." [12:21:06] kask-production-6476b88b99-h24lb 0/1 MatchNodeSelector 0 20d [12:21:08] kask-production-6476b88b99-lrpqj 0/1 MatchNodeSelector 0 20d [12:21:10] kask-production-6476b88b99-q7pd6 0/1 MatchNodeSelector 0 20d [12:21:12] I say we immediately switch to codfw [12:21:12] kask-production-6476b88b99-r67qn 0/1 MatchNodeSelector 0 20d [12:21:12] akosiaris: everything as in kask= [12:21:14] kask-production-6476b88b99-rhbwj 0/1 Pending 0 10m [12:21:16] kask-production-6476b88b99-s94hk 0/1 MatchNodeSelector 0 11m [12:21:18] kask-production-6476b88b99-snddw 0/1 Pending 0 8m28s [12:21:20] kask-production-6476b88b99-tl6hz 0/1 Pending 0 8m26s [12:21:22] kask-production-6476b88b99-x64fs 0/1 Pending 0 10m [12:21:24] kask-production-6476b88b99-zzrsj 0/1 MatchNodeSelector 0 20d [12:21:26] Huh hoh [12:21:28] kart_: Yeah, sessionstore outage, on it [12:21:28] * akosiaris making sure codfw is ok [12:21:40] akosiaris: it's probably not [12:21:43] akosiaris: same state more or less [12:21:44] I'll revert [12:21:48] yes, neither codfw is ok [12:21:55] somewhat better, but not ok [12:22:22] akosiaris: status page update ? [12:22:29] <_joe_> jayme: please revert by hand and apply, it's a full outage [12:22:32] <_joe_> claime: yes [12:22:43] Don't think I can do it _joe_ but go ahead [12:22:47] yeah, I am on bureaucracy, please handle the tech side of things [12:22:50] <_joe_> everyone is logged out and it means it's very risky [12:23:01] PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:23:03] <_joe_> jayme: please ack you're reverting? [12:23:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) OK, I'm starting to look into this now. First of all, checking the two servers in codfw we can see that... [12:23:23] https://phabricator.wikimedia.org/T325056 claime - same? [12:23:26] <_joe_> this is a particularly bad outage [12:23:43] !log sessionstore outage, login functions severely impacted [12:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:46] kart_: yes [12:24:00] _joe_: changed the sessionstore [12:24:18] <_joe_> jayme: are you reverting the change? [12:24:19] node selector [12:24:22] eqiad coming back [12:24:31] sorry. I changed the session store node selector [12:24:40] seemed quicker than puppet revert [12:24:48] should I cleanup the pods in MatchNodeSelector status ? [12:25:06] We only have 8 running pods atm [12:25:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:25:18] (ProbeDown) resolved: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#sessionstore:8081 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:19] (ProbeDown) resolved: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#sessionstore:8081 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:44] did the same for codfw [12:25:54] claime: I'm on it [12:25:58] ack [12:26:03] <_joe_> jayme: ok well done :) [12:26:07] <_joe_> so sessionstore is back? [12:26:43] edits have shot back up to near-normal levels anyway [12:26:45] At least partially [12:26:46] <_joe_> can someone login please? [12:27:00] LVS has many connections to the sessionstore backend where it didn't a few mins ago [12:27:05] I can log in [12:27:08] and log out [12:27:17] claime: 8 is normal [12:27:18] same [12:27:25] login confirmed working [12:27:27] (same as kostajh that is, I can login) [12:27:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:28:08] jayme: ack [12:28:13] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:28:36] Confirm 8 pods in running state for sessionstore eqiad and codfw [12:28:42] We should be ok [12:28:45] (JobUnavailable) resolved: Reduced availability for job swagger_check_sessionstore_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:29:15] https://www.wikimediastatus.net/ is updated, not sure how though [12:29:41] I think one of our lovely on-calls did it [12:30:17] <_joe_> akosiaris: I did :P [12:30:21] <_joe_> can we declare the incident closed? [12:30:23] sorry y'all ... I completely missed that we pin sessionstore to rows - and I don't even understand why [12:30:24] ah, I was about to [12:30:45] _joe_: let it be at monitoring for like 10m [12:30:52] and then we close it [12:30:53] Yeah [12:31:03] <_joe_> ok [12:31:03] we probably have to update some phab tasks anyway [12:31:36] !log sessionstore outage being monitored [12:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad, AS64610/IPv6: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:09] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv6: Active - aux-k8s-eqiad, AS64610/IPv4: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:19] ^not prod [12:32:43] probably me as well...will check in a bit [12:32:51] aux is unused AFAIC [12:32:55] it is [12:33:17] I'll ack the icinga alerts [12:34:12] It's aux-k8s-worker1001 and aux-k8s-worker1002 fwiw [12:34:23] The ctrl nodes BGP sessions are still working [12:34:42] topranks: thanks [12:34:55] icinga checks acknowledged [12:35:03] <_joe_> topranks: do you see anything wrong in the metrics re: login/edits? [12:35:45] _joe_: general pattern seems to be back where it was on all graphs I've checked [12:36:21] edits look back to normal yeah [12:36:37] <_joe_> https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=now-1h&to=now isn't great still, but I think it's due to people finishing editing after logging in at the wrong time [12:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:36:53] <_joe_> err s/login/obtaining a session/, those are different things [12:37:36] <_joe_> jayme: did you fix codfw as well? [12:37:46] _joe_: yes he did [12:37:55] _joe_: yes. 12:25 [12:38:22] <_joe_> ok sorry just checking before declaring the incident resolved on the status page [12:38:37] <_joe_> jayme: now you rightfully earned your t-shirt [12:38:59] <_joe_> \o/ [12:39:02] finally... [12:39:10] <_joe_> bd808: we really need to print another batch [12:39:22] ...I hope there is no cristmas edition [12:39:30] Hahahaha [12:39:32] *winter holiday [12:39:36] Christmas jumper special edition [12:39:46] oh oh oh I broke wikipedia [12:40:15] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:40:33] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) a:03Papaul [12:41:32] This is related to the earlier BGP alerts on aux-k8s-eqiad [12:41:45] I'll go silence it for an hour, it's not prod [12:42:59] RECOVERY - MediaWiki edit session loss on graphite1004 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:44:44] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10MoritzMuehlenhoff) [12:44:54] (03PS1) 10JMeybohm: sessionstore: Don't pin sessionstore to specific rows [deployment-charts] - 10https://gerrit.wikimedia.org/r/867572 (https://phabricator.wikimedia.org/T325056) [12:45:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:46:22] (03PS1) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [12:47:25] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:48:12] (03CR) 10CI reject: [V: 04-1] Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [12:51:10] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:57:43] (03PS2) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [12:57:47] (03PS1) 10Hashar: pcc: tag gerrit review with 'autogenerated:pcc-py' [puppet] - 10https://gerrit.wikimedia.org/r/867579 [12:59:28] (03CR) 10CI reject: [V: 04-1] Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:02:55] (03PS3) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [13:05:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:10:01] (03PS4) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [13:11:34] (03CR) 10MSantos: [C: 03+1] maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [13:13:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:17:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] sessionstore: Don't pin sessionstore to specific rows [deployment-charts] - 10https://gerrit.wikimedia.org/r/867572 (https://phabricator.wikimedia.org/T325056) (owner: 10JMeybohm) [13:20:00] (03CR) 10JMeybohm: [C: 03+2] sessionstore: Don't pin sessionstore to specific rows [deployment-charts] - 10https://gerrit.wikimedia.org/r/867572 (https://phabricator.wikimedia.org/T325056) (owner: 10JMeybohm) [13:22:21] (03PS5) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [13:24:13] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2044 & mc2043 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867588 (https://phabricator.wikimedia.org/T293012) [13:24:55] (03Merged) 10jenkins-bot: sessionstore: Don't pin sessionstore to specific rows [deployment-charts] - 10https://gerrit.wikimedia.org/r/867572 (https://phabricator.wikimedia.org/T325056) (owner: 10JMeybohm) [13:25:54] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2044 & mc2043 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867588 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [13:26:16] sessionstore outage document at: https://wikitech.wikimedia.org/wiki/Incidents/2022-12-13_sessionstore [13:27:04] (03PS1) 10JMeybohm: k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) [13:27:19] (03PS2) 10JMeybohm: k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) [13:28:34] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-stretch2002.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2002 [13:28:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-stretch2002.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2002 [13:28:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0e9efa15-8be5-4b76-ad0c-4cdddb24836e) set by btul... [13:29:34] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38754/console" [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [13:30:55] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:47] (03CR) 10Krinkle: [C: 03+1] grafana: Explicitly set default theme to light [puppet] - 10https://gerrit.wikimedia.org/r/867527 (owner: 10Alexandros Kosiaris) [13:40:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:42:20] (03PS1) 10Btullis: Correct the description for insetup::data_engineering [puppet] - 10https://gerrit.wikimedia.org/r/867592 [13:45:23] (03PS1) 10Slyngshede: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 [13:45:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867592 (owner: 10Btullis) [13:45:45] (03CR) 10CI reject: [V: 04-1] C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [13:46:41] (03PS2) 10Slyngshede: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 [13:48:47] (03PS2) 10Volans: cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) [13:48:48] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [13:49:13] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [13:49:21] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [13:50:15] (03PS3) 10Slyngshede: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 [13:51:30] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:51:55] (03CR) 10CI reject: [V: 04-1] C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [13:57:05] (03CR) 10Hashar: [C: 03+1] rm hieradata/hosts/contint1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/867286 (https://phabricator.wikimedia.org/T324698) (owner: 10Dzahn) [13:57:08] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [13:57:23] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [13:57:35] (03CR) 10Hashar: [C: 03+1] Revert "bacula: Ignore the backup check of contint1001 jobs" [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [13:57:40] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [13:59:00] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [13:59:12] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [13:59:54] (03CR) 10JMeybohm: [C: 03+2] calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) (owner: 10JMeybohm) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1400). [14:00:05] duesen and Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1400) [14:03:17] (03PS1) 10FNegri: Add "snakeoil" private key cloud_cumin_master [labs/private] - 10https://gerrit.wikimedia.org/r/867595 (https://phabricator.wikimedia.org/T323483) [14:03:58] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-stretch2001.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2001 [14:04:19] (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/867595 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [14:04:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-stretch2001.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2001 [14:04:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2a6ae5e9-eb12-4391-a29f-7bd50197f53a) set by btul... [14:04:38] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add "snakeoil" private key cloud_cumin_master [labs/private] - 10https://gerrit.wikimedia.org/r/867595 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [14:04:51] Deployers, I can cover for duesen relating to the backport patch. [14:05:08] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:05:10] But I think he'll join us soon, so I'll be here for the mean time. [14:05:23] (03Merged) 10jenkins-bot: calico, cfssl-issuer: Remove chart defined dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/867529 (https://phabricator.wikimedia.org/T303279) (owner: 10JMeybohm) [14:06:28] I can’t deploy unfortunately, I’m in a meeting [14:09:11] * Reedy looks what is to deploy [14:09:50] (03CR) 10Volans: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:11:09] o/ [14:11:19] xSavitar: hey, I'm here now [14:11:20] sorry [14:11:56] duesen, okay! I was wondering if I can deploy your 2 patches and then you verify? I'm on the call [14:12:15] Couldn't do it since I was the only one around. [14:12:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10Jclark-ctr) @bking i need server depooled and shutdown for hardware testing. do you have time to assist today with that? [14:12:28] xSavitar: yes, ok. which call? [14:12:31] But seems Reedy is looking at the deployment calendar too. Maybe he wants to help [14:12:48] duesen the one on our calendar. I received an invite. [14:15:25] * xSavitar will go ahead and deploy now [14:17:05] !log derick@deploy1002 Backport cancelled. [14:19:20] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [14:21:28] (03CR) 10D3r1ck01: [C: 03+2] "scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867274 (https://phabricator.wikimedia.org/T246403) (owner: 10Arlolra) [14:21:39] (03CR) 10D3r1ck01: [C: 03+2] "scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867292 (owner: 10Arlolra) [14:22:13] (03CR) 10Herron: [C: 03+2] grafana: Explicitly set default theme to light [puppet] - 10https://gerrit.wikimedia.org/r/867527 (owner: 10Alexandros Kosiaris) [14:22:50] !log added smunene to pwstore [14:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [14:25:12] (03PS3) 10D3r1ck01: hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [14:25:21] (03CR) 10TrainBranchBot: "Approved by derick@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [14:25:36] (03PS4) 10Slyngshede: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 [14:26:05] (03Merged) 10jenkins-bot: hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [14:26:31] !log derick@deploy1002 Started scap: Backport for [[gerrit:866627|hewiki: set VisualEditor to direct mode (T320529)]] [14:26:35] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [14:26:56] (03PS1) 10Clément Goubert: Revert "eventgate-analytics: bump replicas from 20 to 30" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867597 (https://phabricator.wikimedia.org/T324994) [14:27:18] (03CR) 10jenkins-bot: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [14:28:21] !log derick@deploy1002 derick and daniel: Backport for [[gerrit:866627|hewiki: set VisualEditor to direct mode (T320529)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:29:05] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add "mc2041 mc2042" to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867598 (https://phabricator.wikimedia.org/T293012) [14:29:07] duesen changes on the debug hosts now. [14:29:51] sorry, I hope that I did not arrive here too late [14:29:58] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/867237/ [14:30:33] (03PS1) 10Elukey: kserve-inference: fix dependencies in Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/867600 (https://phabricator.wikimedia.org/T303279) [14:31:10] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: hardware diagnostics [14:31:21] (03PS2) 10Clément Goubert: Revert "eventgate-analytics: bump replicas from 20 to 30" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867597 (https://phabricator.wikimedia.org/T324994) [14:31:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: hardware diagnostics [14:31:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6ab1ce1c-f2e1-46e4-b6d5-7551d7cbe870) set by bking@cumin2002 for 2 days, 0:00:00 o... [14:32:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch2002.codfw.wmnet with OS bullseye [14:32:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch20... [14:32:41] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:33:02] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add "mc2041 mc2042" to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867598 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [14:33:06] Func: xSavitar and duesen are deploying at the moment, maybe they can do your backport afterwards [14:33:24] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add "mc2041 mc2042" to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867598 (https://phabricator.wikimedia.org/T293012) [14:34:55] confirmed with duesen that the config patch works [14:35:01] syncing now... [14:36:26] (03Merged) 10jenkins-bot: Log linter data while parsing full pages [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867274 (https://phabricator.wikimedia.org/T246403) (owner: 10Arlolra) [14:36:32] (03Merged) 10jenkins-bot: Parsoid: Enable lint data and parser cache together [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867292 (owner: 10Arlolra) [14:37:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10bking) @Jclark-ctr No problem . wcqs1003 is now depooled and shut down. Reach out here or to inflatador (me) in IRC if you need anything else. Tha... [14:38:18] 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [14:39:21] (03PS5) 10Slyngshede: C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 [14:40:02] (03PS1) 10AikoChou: ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) [14:40:46] 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Since this incident was caused by a temporary raise in logging volume, and our response was to scale up... [14:41:05] !log derick@deploy1002 Finished scap: Backport for [[gerrit:866627|hewiki: set VisualEditor to direct mode (T320529)]] (duration: 14m 34s) [14:41:09] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [14:41:40] syncing done. will wait for the core backport to merge [14:43:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867292 (owner: 10Arlolra) [14:44:50] (03Abandoned) 10Slyngshede: C:ldap::client::utils add missing directory. [puppet] - 10https://gerrit.wikimedia.org/r/867125 (owner: 10Slyngshede) [14:47:11] jeena: are you around? [14:47:16] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) [14:47:36] (03CR) 10David Caro: labstore: Send prom stats for getent_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [14:47:43] jeena: ...we are trying to understand why scap backport is stuck on "collecting commits" [14:47:50] (03CR) 10Majavah: cumin::cloud_master: introduce new profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:49:41] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) p:05Triage→03Medium [14:50:48] !log derick@deploy1002 backport aborted: (duration: 07m 09s) [14:51:03] (03CR) 10BPirkle: [C: 03+1] "Looks good per synchronous discussion. Approved for self-merge." [deployment-charts] - 10https://gerrit.wikimedia.org/r/865683 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [14:51:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867274 (https://phabricator.wikimedia.org/T246403) (owner: 10Arlolra) [14:52:11] !log derick@deploy1002 Started scap: Backport for [[gerrit:867274|Log linter data while parsing full pages (T246403)]] [14:52:14] Is it confused by already merged patches? [14:52:15] T246403: Lint error counts on "Page information" page do not update, even with null edit - https://phabricator.wikimedia.org/T246403 [14:52:23] Or just some odd edge case? [14:53:00] Reedy, we used the child patch (with both patches on the chain merged) and it works. [14:53:02] (03PS1) 10Jbond: cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 [14:53:12] Still sounds like a bug to me :) [14:53:16] Yes [14:53:16] Reedy: well, we merged to patches, and tried to deploy the second (top) one. And it got stuck. [14:53:21] Worth filing in phab, as I doubt this is an uncommon use cse [14:53:40] Reedy: then we tried to deploy the first one, and it correctly told us that there's another patch already merged, asked us to confirm, and proceded. [14:53:57] !log derick@deploy1002 derick and arlolra: Backport for [[gerrit:867274|Log linter data while parsing full pages (T246403)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:53:57] (03PS2) 10Ssingh: hiera: unify eqsin LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) [14:54:28] (03CR) 10Jcrespo: [C: 04-1] "Antoine, does the +1 mean that this is ready to be deployed, or just that it looks ok for when it is ready? It is unclear just with the vo" [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [14:54:39] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) [14:54:47] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38756/console" [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [14:55:03] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [14:55:50] (03PS2) 10Ssingh: site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) [14:55:52] 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10jijiki) [14:56:36] syncing the backports now [14:56:52] (03CR) 10Roman Stolar: [C: 03+1] Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) (owner: 10Vlad.shapik) [14:58:48] Reedy, will do [14:59:01] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ottomata) > reverting to the state before the incident Hm, do we need to revert? I don't mind eith... [14:59:40] (03CR) 10Jbond: [C: 04-1] "-1 for the template vs source otherwise lgtm. however it my be nice to make profile::cumin::master a bit more DRY i did a quick PoC[1] but" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:00:32] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops-collab: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10jijiki) [15:00:54] Func, we're already at time but do you want me to deploy your patch? [15:01:04] ok [15:01:08] The backports of Daniel's is syncing now [15:01:11] So in a few mins [15:02:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:02:39] (03CR) 10Jbond: [C: 03+1] "might be usefull to have this one explicitly depend on https://gerrit.wikimedia.org/r/c/operations/puppet/+/867551?" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:02:40] !log derick@deploy1002 Finished scap: Backport for [[gerrit:867274|Log linter data while parsing full pages (T246403)]] (duration: 10m 28s) [15:02:44] T246403: Lint error counts on "Page information" page do not update, even with null edit - https://phabricator.wikimedia.org/T246403 [15:03:15] Func, on your patch now [15:03:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867237 (https://phabricator.wikimedia.org/T228431) (owner: 10Func) [15:03:25] thanks [15:05:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but let's wait with merging until David or someone else from WMCS had a chance to review." [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [15:06:46] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) >>! In T324994#8463585, @Ottomata wrote: >> reverting to the state before the inci... [15:07:03] (03PS1) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) [15:07:13] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:07:21] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10MoritzMuehlenhoff) Happy to provide input, but the task description could use a little more context :-) [15:07:27] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add restbase routing, enable in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865683 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [15:08:20] Func, we're waiting for the patch to merge now. will take a little while ;) [15:08:36] ack [15:11:35] (03PS2) 10Jbond: cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 [15:11:55] (03CR) 10CI reject: [V: 04-1] cumin::master: WIP/PoC make profile a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/867602 (owner: 10Jbond) [15:12:26] (03Merged) 10jenkins-bot: api-gateway: add restbase routing, enable in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865683 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [15:13:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38757/console" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [15:16:08] (03CR) 10Majavah: [V: 03+1 C: 04-1] cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [15:16:09] Func, zuul says ~2 mins left [15:16:25] (03PS2) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) [15:17:16] (03CR) 10Muehlenhoff: cumin::master: WIP/PoC make profile a bit more DRY (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867602 (owner: 10Jbond) [15:18:24] (03CR) 10DCausse: [C: 03+1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:18:27] (03Merged) 10jenkins-bot: RangeChronologicalPager: Restore the compatibility with derived classes [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867237 (https://phabricator.wikimedia.org/T228431) (owner: 10Func) [15:18:53] !log derick@deploy1002 Started scap: Backport for [[gerrit:867237|RangeChronologicalPager: Restore the compatibility with derived classes (T228431 T325034)]] [15:18:59] T228431: On Special:Contributions, choosing to display a large number of results (e.g. 500) ignores the chosen time frame - https://phabricator.wikimedia.org/T228431 [15:18:59] T325034: Removal of RangeChronologicalPager::rangeConds breaks Special:CheckUser's period selection and causes CheckUser pipeline failures - https://phabricator.wikimedia.org/T325034 [15:19:01] (03CR) 10Klausman: [C: 03+1] kserve-inference: fix dependencies in Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/867600 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [15:19:38] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [15:19:46] (03CR) 10Jbond: [C: 03+1] "lgtm few optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [15:19:54] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [15:20:38] !log derick@deploy1002 derick and func: Backport for [[gerrit:867237|RangeChronologicalPager: Restore the compatibility with derived classes (T228431 T325034)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:20:55] Func, your patch is on the debug hosts now, you can try with: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:21:07] I don't have sufficent permission to test check user, but reviewers have confirmed it works. [15:21:20] so should I go ahead and sync? [15:21:36] I think you can [15:21:40] Okay! [15:21:53] Syncing now [15:23:45] (03CR) 10Hnowlan: [C: 03+2] conftool: add kubernetes nodes as thumbor nodes [puppet] - 10https://gerrit.wikimedia.org/r/866445 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:24:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:24:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:29] 10SRE, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): A puppet run should not start if a box is under abnormal load. - https://phabricator.wikimedia.org/T84183 (10LSobanski) [15:26:55] (03PS1) 10Hashar: wm-checks-api: show processor prototype name on error [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867648 [15:27:33] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2040 & mc2041 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867649 (https://phabricator.wikimedia.org/T293012) [15:27:53] !log derick@deploy1002 Finished scap: Backport for [[gerrit:867237|RangeChronologicalPager: Restore the compatibility with derived classes (T228431 T325034)]] (duration: 08m 59s) [15:27:58] T228431: On Special:Contributions, choosing to display a large number of results (e.g. 500) ignores the chosen time frame - https://phabricator.wikimedia.org/T228431 [15:27:58] T325034: Removal of RangeChronologicalPager::rangeConds breaks Special:CheckUser's period selection and causes CheckUser pipeline failures - https://phabricator.wikimedia.org/T325034 [15:28:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) I've tried a reinstall of kafka-stretch2002 with slightly different RAID controller settings, but that di... [15:28:10] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10LSobanski) [15:28:32] Func, patch is live now. [15:28:34] 10SRE-Access-Requests: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10Milimetric) [15:28:43] I think we can call this window a close :) [15:28:48] xSavitar: Thank you! [15:28:55] Func, you're welcome! [15:29:40] thank you for the fix Func [15:30:02] thanks for taking care of the deployments xSavitar! [15:30:10] \o/ [15:31:34] MatmaRex, sorry for my thoughtless at first patch [15:32:00] (03PS6) 10Muehlenhoff: Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) [15:32:02] heh, no problem, i should have reviewed more carefully [15:32:52] 10SRE, 10Infrastructure-Foundations: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10LSobanski) Looks like the disk space problem was addressed in T295767 but the upgrade to Bullseye is still pending. [15:33:13] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2040 & mc2041 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867649 (https://phabricator.wikimedia.org/T293012) [15:34:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [15:34:51] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2040 & mc2041 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867649 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [15:34:57] 10SRE, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10LSobanski) @Dzahn other than the outdated host names, is this task still relevant? [15:35:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20804 [15:35:40] (03CR) 10Roman Stolar: [C: 03+1] Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) (owner: 10Vlad.shapik) [15:36:37] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/861806/38755/" [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli) [15:36:44] 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10Miriam) Approved on my end, thank you! [15:37:01] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [15:40:07] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:40:20] (03PS3) 10Volans: cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) [15:42:01] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:08] (03CR) 10Vgutierrez: [C: 03+1] hiera: unify eqsin LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:44:01] (03CR) 10Volans: "Replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:44:37] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: unify eqsin LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:45:56] (03PS3) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) [15:47:42] 10SRE, 10Infrastructure-Foundations: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [15:49:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:51:50] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [15:54:11] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes101[234].eqiad.wmnet [15:58:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20804 [16:01:58] !log hnowlan@puppetmaster1001 conftool action : set/weight=4:pooled=yes; selector: service=thumbor,name=kubernetes101[1234].eqiad.wmnet [16:02:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Ottomata) @BTullis, I think you meant to post these comments on {T314156}? :) [16:02:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) I think it might be OK on kafka-stretch2002 now. It's successfully run the installer and booted. I've run... [16:03:32] !log installing ruby-tzinfo security updates [16:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:30] (03PS2) 10AikoChou: ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) [16:05:35] (03PS41) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:05:37] (03PS4) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [16:05:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) @Ottomata - Many thanks. [16:05:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[01234].eqiad.wmnet [16:06:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:20] ack, looking [16:07:37] possibly related to depooling k8s nodes [16:08:07] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:11:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:47] (03CR) 10Herron: "Good feedback here thanks! please see comments inline" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [16:12:22] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2038 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867653 (https://phabricator.wikimedia.org/T293012) [16:12:34] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2002.codfw.wmnet with OS bullseye [16:12:39] (03CR) 10Dzahn: [V: 03+1] scap: add stanza for jenkins deploy, new keyholder identity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [16:12:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2002.c... [16:12:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff) [16:12:49] (03PS2) 10Dzahn: scap: add stanzas for jenkins-ci and jenkins-releases deploy [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) [16:14:10] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2038 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867653 (https://phabricator.wikimedia.org/T293012) [16:14:53] 10SRE, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10Dzahn) @LSobanski good find! Hmm.. probably not but maybe let's talk about it for a minute before declining it. [16:15:10] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10Dzahn) [16:15:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) I tried a reinstall of kafka-stretch2002 with slightly different RAID controller settings, but that didn't work either. This is capture... [16:15:25] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2038 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867653 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [16:15:45] !log hnowlan@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [16:17:20] !log hnowlan@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [16:21:02] (03CR) 10Dzahn: "Jaime, fair comment but no worries. Either way I will merge this after decom is done and the +1 I got on https://gerrit.wikimedia.org/r/c/" [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [16:21:07] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ladsgroup) Hi, The flood of logs is still incoming, the revert of logspam has not been deployed yet... [16:23:31] (03PS1) 10Isabelle Hurbain-Palatin: Adding ihurbain to parsoid-test-root [puppet] - 10https://gerrit.wikimedia.org/r/867657 [16:23:35] (03CR) 10Eevans: Configure new cassandra-dev cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [16:25:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10Jclark-ctr) Reseated Hard Drive and preformed hardware test showed no errors at this time. Resubmitted Tsr report to dell [16:25:34] (03CR) 10Arlolra: Adding ihurbain to parsoid-test-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867657 (owner: 10Isabelle Hurbain-Palatin) [16:27:34] (03PS2) 10Isabelle Hurbain-Palatin: Adding ihurbain to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/867657 [16:28:19] (03CR) 10Arlolra: [C: 03+1] Adding ihurbain to parsoid-test-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867657 (owner: 10Isabelle Hurbain-Palatin) [16:31:13] (03CR) 10Dzahn: [C: 03+1] "this all looks good to me and I realize this might seem trivial but I'm afraid I still should not just merge admin data changes without a " [puppet] - 10https://gerrit.wikimedia.org/r/867657 (owner: 10Isabelle Hurbain-Palatin) [16:33:30] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) It's definitely a bit vague at the moment :) I just wanted to split it from the main task, as it can be discussed and addressed... [16:35:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [16:35:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch10... [16:35:10] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye [16:35:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.e... [16:36:30] 10SRE-Access-Requests: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10Dzahn) [16:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:36:53] (03PS3) 10Dzahn: Adding ihurbain to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [16:42:11] (03CR) 10Dzahn: "I made and linked https://phabricator.wikimedia.org/T325080 for this" [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [16:44:20] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:44:37] 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10Dzahn) a:03akosiaris [16:45:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:47:30] (03PS1) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [16:47:50] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [16:49:25] (03PS2) 10Dzahn: rm hieradata/hosts/contint1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/867286 (https://phabricator.wikimedia.org/T324698) [16:49:27] (03PS3) 10Dzahn: scap: add stanza for jenkins deploy, new keyholder identity [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) [16:49:29] (03PS1) 10Dzahn: scap: add contint2002 to ci-docroot, jenkins, zuul deploy [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) [16:50:21] (03PS2) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [16:50:47] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [16:51:09] (03CR) 10Dzahn: "Not entirely sure about the order of things yet. But we will want to move this host into production and start deploying to it.. at least a" [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:51:29] (03CR) 10Dzahn: "accidentally rebased.. the PS before this was what I wanted" [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [16:52:04] (03CR) 10Ottomata: Backing up HDFS FSImage to HDFS on Monday morning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [16:53:10] jouncebot: nowandnext [16:53:11] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [16:53:11] In 0 hour(s) and 6 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1700) [16:53:11] (03PS4) 10Dzahn: scap: add stanza for jenkins-ci and jenkins-releases deploy [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) [16:55:30] (03PS1) 10Dzahn: site: add contint2002 to ci::master role [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) [16:55:36] (03CR) 10Jforrester: [C: 03+1] "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) (owner: 10Reedy) [17:00:02] (03PS1) 10Dzahn: cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 [17:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1700). [17:00:04] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:13] o/ [17:00:18] Lucas_WMDE: hello! looking [17:00:27] thanks! [17:00:43] * jbond is here if needed [17:01:09] jbond: actually, I have to take off in a few minutes, if you wouldn't mind doing this one that'd be great [17:01:16] just in case anything goes sideways [17:01:44] sure no probs [17:01:48] looking now [17:01:51] thanks <3 [17:01:55] np [17:02:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) Hello, just FYI I reimaged kafka-stretch2002 because the `/dev/sda` and `/dev/sdb` were the wrong way around. {F35861835,width=60%} I'v... [17:03:30] (03CR) 10Isabelle Hurbain-Palatin: Adding ihurbain to parsoid-test-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [17:03:53] (03CR) 10Jbond: [C: 03+2] query_service: support downloads in query builder [puppet] - 10https://gerrit.wikimedia.org/r/867142 (https://phabricator.wikimedia.org/T323451) (owner: 10Lucas Werkmeister (WMDE)) [17:05:15] Lucas_WMDE: thats been merged would you lik me to deploy it anywhere? [17:06:02] ahh never mind i see its only on two serveres will deploy now [17:06:11] I’m not sure where the microsites are served from [17:06:21] otherwise it would’ve been deployed within half an hour, right? [17:06:24] (03PS42) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:06:26] (03PS5) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [17:06:33] (if I remember the puppet run interval correctly) [17:06:45] Lucas_WMDE: yes thats right [17:06:47] ok [17:06:54] I’m ready to test once it’s deployed :) [17:06:56] fyi the are served from miscweb2002.codfw.wmnet,miscweb1002.eqiad.wmnet [17:07:00] im running puppet there now [17:07:05] ok, thanks! [17:07:25] ok should be done, please ping me if you see any issues [17:07:28] Lucas_WMDE: ^^ [17:07:33] checking [17:07:50] yay, it works now! [17:07:53] thanks a lot jbond [17:08:13] awesome and no probs [17:08:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) I should have written the comment above on the kafka-stretch ticket for codfw(T314160) despite the fact that it was resolved.. [17:09:11] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:15:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:15:31] (03PS43) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:15:33] (03PS6) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [17:17:58] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:20:13] 10SRE, 10CX-cxserver, 10Language-Team (Language-2022-October-December): cxserver: Update Flores/NLLB-200 MT secret in Production - https://phabricator.wikimedia.org/T324534 (10akosiaris) 05Open→03Resolved a:03akosiaris Done by Amir on `Date: Tue Dec 13 06:35:24 2022 +0000`. Resolving per last task mo... [17:20:17] (03PS1) 10Ladsgroup: ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867608 [17:20:30] (03PS1) 10Ladsgroup: ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 [17:20:52] jouncebot: nowandnext [17:20:52] For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1700) [17:20:53] In 1 hour(s) and 39 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1900) [17:21:02] (03CR) 10Ladsgroup: [C: 03+2] ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867608 (owner: 10Ladsgroup) [17:21:05] (03CR) 10Ladsgroup: [C: 03+2] ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [17:21:09] (03PS44) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:21:13] (03PS7) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [17:22:16] (03PS45) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:22:18] (03PS8) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [17:22:23] !log btullis@install1003:/etc/dhcp/automation/ttyS1-115200$ sudo systemctl restart isc-dhcp-server.service T314156 [17:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:27] T314156: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 [17:22:29] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:22:45] (03PS1) 10Ladsgroup: Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 [17:22:56] (03PS1) 10Ladsgroup: Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867611 [17:23:58] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10LSobanski) 05Open→03Declined [17:25:29] (03CR) 10Ladsgroup: [C: 03+2] Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [17:25:33] (03CR) 10Ladsgroup: [C: 03+2] Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867611 (owner: 10Ladsgroup) [17:25:35] (03CR) 10David Caro: "Finally passed the tests xd, took more than expected." [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:26:34] !log edited automation/proxies/ttyS1-115200.conf to remove `include "/etc/dhcp/automation/ttyS1-115200/kafka-stretch1001.conf";`and restarted isc-dhc-server [17:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:02] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Andrew) We don't have a great way to safely downtime this box at the moment. We're in the process of moving load off of it entirely but that won't be complete until January at the earliest. Can we live with this in its precarious state... [17:27:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [17:27:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye [17:29:32] (03PS16) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [17:33:42] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:33:46] (03CR) 10CI reject: [V: 04-1] Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [17:37:27] (03Merged) 10jenkins-bot: ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867608 (owner: 10Ladsgroup) [17:40:04] (03PS1) 10Hnowlan: kubernetes: add thumbor to lvs pools for workers [puppet] - 10https://gerrit.wikimedia.org/r/867681 (https://phabricator.wikimedia.org/T233196) [17:43:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes: add thumbor to lvs pools for workers [puppet] - 10https://gerrit.wikimedia.org/r/867681 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:43:38] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Yes, the commit message of the above changelog makes it very clear it is not to be... [17:44:58] (03PS2) 10Hnowlan: kubernetes: add thumbor to lvs pools for workers [puppet] - 10https://gerrit.wikimedia.org/r/867681 (https://phabricator.wikimedia.org/T233196) [17:46:10] (03CR) 10Ladsgroup: [C: 03+2] "again" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [17:46:31] jouncebot: nowandnext [17:46:32] For the next 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1700) [17:46:32] In 1 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1900) [17:49:31] (03PS1) 10Ladsgroup: Fix brittle test [extensions/PageImages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867613 [17:49:41] (03CR) 10Ladsgroup: [C: 03+2] Fix brittle test [extensions/PageImages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867613 (owner: 10Ladsgroup) [17:51:00] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add thumbor to lvs pools for workers [puppet] - 10https://gerrit.wikimedia.org/r/867681 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:52:11] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Hi all, Noting that I added a [[ https://phabricator.wikimedia.org/T316573#8461381 | comment with an action request to make... [17:58:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867611 (owner: 10Ladsgroup) [17:58:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [17:58:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867608 (owner: 10Ladsgroup) [17:58:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [18:01:53] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10colewhite) [18:02:03] duesen: sorry I just saw your message. Is there a phab task for the problem you encountered? [18:03:26] lovely the backport jenkins job is stuck [18:03:54] (03PS17) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [18:06:52] (03CR) 10CI reject: [V: 04-1] ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [18:07:14] (03CR) 10Ladsgroup: [C: 03+2] "sigh" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [18:08:23] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:10:25] (03PS2) 10DLynch: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) [18:10:53] (03PS3) 10DLynch: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) [18:11:10] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @Vgutierrez is there anything left to do so that we can move forward on this task? Please let me know as we are trying to get thi... [18:13:20] (03Merged) 10jenkins-bot: Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [18:13:22] (03Merged) 10jenkins-bot: Don't write to parser cache from maintenance script [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867611 (owner: 10Ladsgroup) [18:14:27] (03PS18) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [18:18:56] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @KOfori can you inform me as to where this falls in the list of your teams priorities? Any ETA on completion? I know this task, in... [18:21:31] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye [18:21:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye ex... [18:22:03] (03Merged) 10jenkins-bot: Fix brittle test [extensions/PageImages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867613 (owner: 10Ladsgroup) [18:30:22] (03PS1) 10Cwhite: logstash: reformat python escape sequences before parsing k8s logs [puppet] - 10https://gerrit.wikimedia.org/r/867629 (https://phabricator.wikimedia.org/T325085) [18:36:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867610 (owner: 10Ladsgroup) [18:36:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867608 (owner: 10Ladsgroup) [18:38:21] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867610|Don't write to parser cache from maintenance script]], [[gerrit:867608|ParserCache: fix metrics keys]] [18:40:20] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:867610|Don't write to parser cache from maintenance script]], [[gerrit:867608|ParserCache: fix metrics keys]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [18:47:47] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867610|Don't write to parser cache from maintenance script]], [[gerrit:867608|ParserCache: fix metrics keys]] (duration: 09m 25s) [18:48:33] jeena: yes, xSavitar filed one: https://phabricator.wikimedia.org/T325074 [18:50:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [18:50:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867611 (owner: 10Ladsgroup) [18:50:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/PageImages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867613 (owner: 10Ladsgroup) [18:51:29] !log decom'ing contint1001 (formerly prod CI) server, replaced by contint1002 T324698 [18:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:33] T324698: contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 [18:51:50] (03Merged) 10jenkins-bot: ParserCache: fix metrics keys [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867609 (owner: 10Ladsgroup) [18:51:52] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts contint1001.wikimedia.org [18:52:19] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867609|ParserCache: fix metrics keys]], [[gerrit:867611|Don't write to parser cache from maintenance script]], [[gerrit:867613|Fix brittle test]] [18:54:05] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:867609|ParserCache: fix metrics keys]], [[gerrit:867611|Don't write to parser cache from maintenance script]], [[gerrit:867613|Fix brittle test]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:55:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) OK, I cleaned up the failed bit of DHCP automation that was causing the cookbook to fail on kafka-stretch2001. Now we're back to the si... [18:57:19] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:59:43] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [19:00:05] hashar and ^demon: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T1900) [19:00:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867609|ParserCache: fix metrics keys]], [[gerrit:867611|Don't write to parser cache from maintenance script]], [[gerrit:867613|Fix brittle test]] (duration: 07m 53s) [19:14:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Dzahn) the state of parse1002 was manually changed in netbox from "active" to "failed" but there was no sync / cookbook run. This meant at next unrelated deco... [19:16:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:53] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10Dzahn) The state of ganeti1011 was manually changed from "active" to "failed" in netbox but there was no netbox data sync after that. That meant at next decom cookbook run on unrelated hosts we got unexpec... [19:18:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [19:18:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint1001.wikimedia.org [19:20:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:18] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:54] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:31:19] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:44] (03CR) 10Subramanya Sastry: [C: 03+1] Adding ihurbain to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [19:43:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10ssastry) In the puppet repo, I am tagged as the approver from when I was the manager. But, I approve, if that counts! :) I am not the manager anymore, so added @dr0ptp4kt to ap... [19:49:57] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10LDAP-Access-Requests, 10WMF-Communications: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10Varnent) [19:50:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @btullis yes, if you want to recreate the raid manually then please do. [19:52:46] (03CR) 10Dzahn: [C: 03+2] rm hieradata/hosts/contint1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/867286 (https://phabricator.wikimedia.org/T324698) (owner: 10Dzahn) [19:54:25] (03CR) 10Dzahn: [C: 03+2] "decom cookbook has run now - host is gone for real" [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [19:55:13] (03CR) 10Dzahn: [C: 03+2] Revert "bacula: Ignore the backup check of contint1001 jobs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [19:56:58] (03PS1) 10Ryan Kemper: wdqs: fix request request error ratio sli pane [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/867695 (https://phabricator.wikimedia.org/T323064) [19:58:02] (03PS1) 10Dzahn: site: remove contint1001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/867696 (https://phabricator.wikimedia.org/T324698) [19:58:49] (03CR) 10Dzahn: [C: 03+2] site: remove contint1001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/867696 (https://phabricator.wikimedia.org/T324698) (owner: 10Dzahn) [20:01:12] (03CR) 10Gehel: [C: 03+1] "This makes sense to me (but I'm far from an expert)" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/867695 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:02:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:03:57] (03PS8) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [20:04:48] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [20:06:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10dr0ptp4kt) Approved [20:07:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:10:12] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:10:16] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [20:10:51] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:11:38] (03CR) 10Dzahn: "note there is also a file like this for eqiad, hieradata/eqiad/profile/openstack/eqiad1/cloudgw.yaml but this did not pop up for me when s" [puppet] - 10https://gerrit.wikimedia.org/r/867675 (owner: 10Dzahn) [20:12:46] (03PS2) 10Dzahn: cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) [20:12:57] (03CR) 10CI reject: [V: 04-1] cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [20:15:14] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:15:22] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:16:00] ryankemper: I supposed that the alert is related to the failed cookbook above? [20:16:27] gehel: yes, I'll set some downtime [20:16:59] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling restart [20:17:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling restart [20:17:30] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:17:34] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [20:18:48] (03PS3) 10Dzahn: cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) [20:19:12] (03CR) 10Andrew Bogott: [C: 03+1] "nit: the dumps servers are now called 'clouddumps100x'" [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:20:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:20:38] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 318, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:20:38] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:20:48] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:20:48] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:23:22] (03CR) 10Raymond Ndibe: [C: 03+2] webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [20:24:12] (03Merged) 10jenkins-bot: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [20:33:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:38:00] (03PS2) 10Ryan Kemper: wdqs: fix request request error ratio sli pane [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/867695 (https://phabricator.wikimedia.org/T323064) [20:38:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:41:40] (03PS1) 10Dzahn: ci: add contint2002 to firewall, jenkins and zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/867703 (https://phabricator.wikimedia.org/T324659) [20:44:41] (03PS1) 10Dzahn: ci/zuul: switch gearman server from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) [20:45:12] (03CR) 10Dzahn: "For later when we are ready for this, of course." [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [20:45:21] (03PS1) 10JHathaway: Add Sondes Ben Chagra to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) [20:46:19] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10LDAP-Access-Requests, and 2 others: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10jhathaway) added! [20:48:13] (03PS15) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [20:49:06] (03PS16) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [20:49:21] (03PS1) 10Effie Mouzeli: mediawiki-common: Replace redis_session servers with rdb* [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) [20:49:33] (03PS1) 10Dzahn: docker_registry_ha: add contint2002 to image builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/867708 (https://phabricator.wikimedia.org/T324659) [20:51:46] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38758/console" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [20:52:40] (03PS1) 10Andrew Bogott: Fix rsyslogd $cert_file when using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) [20:52:59] (03CR) 10CI reject: [V: 04-1] Fix rsyslogd $cert_file when using acme certs [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [20:54:09] (03PS1) 10Dzahn: ci: add contint2002 to zuul_merger firewall, ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867710 (https://phabricator.wikimedia.org/T324659) [20:54:53] (03PS17) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [20:55:22] (03PS2) 10Zabe: Start writing to cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863228 (https://phabricator.wikimedia.org/T233004) [20:55:36] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) one topic branch for an overview of these changes: https://gerrit.wikimedia.org/r/q/topic:contint2002 [20:55:42] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) [20:56:10] (03PS2) 10Andrew Bogott: rsyslog::receiver: fix cert_file when used with acme certs [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) [20:58:48] (03PS1) 10Dzahn: ci: add contint2002 as an migration rsync source host [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221213T2100). [21:00:05] jan_drewniak and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:37] * TheresNoTime can deploy [21:00:46] O/ I’m a few minutes late… [21:00:58] (03PS1) 10Dzahn: ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 [21:01:01] (03PS18) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [21:01:06] jouncebot: right on time :) [21:01:09] oops [21:01:12] jan_drewniak: ^ [21:01:15] hey [21:01:49] (03PS1) 10Dzahn: scap: remove contint2001 from "dsh groups" [puppet] - 10https://gerrit.wikimedia.org/r/867713 [21:02:00] I'll start with the popups backport jan_drewniak [21:02:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867233 (https://phabricator.wikimedia.org/T325007) (owner: 10Jdlrobson) [21:02:17] Ok [21:03:01] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38760/console" [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [21:03:15] (03CR) 10Raymond Ndibe: "hello David, thanks for making these final changes. reviewing soon!" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:05:17] (03PS2) 10Dzahn: ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 (https://phabricator.wikimedia.org/T324659) [21:05:36] (03CR) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [21:05:58] I can deploy. [21:06:15] (03PS1) 10Dzahn: elasticsearch/relforge: add contint2002 to cirrus::ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867714 (https://phabricator.wikimedia.org/T324659) [21:06:19] Oops [21:06:27] (Wasn't scrolled down. :) ) [21:06:41] kindrobot: I'm midway through 867233, but you're welcome to take the config patch after? :) [21:07:15] Sure, sounds good. [21:07:26] I'll ping you :) [21:07:31] Great. [21:07:37] (03Merged) 10jenkins-bot: Child elements also trigger previews [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867233 (https://phabricator.wikimedia.org/T325007) (owner: 10Jdlrobson) [21:08:06] !log samtar@deploy1002 Started scap: Backport for [[gerrit:867233|Child elements also trigger previews (T325007)]] [21:08:10] T325007: [regression,subtask] page preview doesn't display when mouse hovers over wikilinked text that has italic and / or span (and perhaps other html) markup - https://phabricator.wikimedia.org/T325007 [21:09:53] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:867233|Child elements also trigger previews (T325007)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:10:08] jan_drewniak: that's live on mwdebug, can you test? :) [21:11:34] TheresNoTime: yup, I see the fix on mwdebug1001 [21:11:43] great, syncing [21:13:24] (03PS3) 10Andrew Bogott: rsyslog::receiver: fix cert_file when used with acme certs [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) [21:14:27] (03CR) 10Hashar: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867656 (owner: 10Hashar) [21:17:02] (03CR) 10Southparkfan: [C: 03+1] "+1, resolves certificate chaining issues" [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [21:17:44] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:867233|Child elements also trigger previews (T325007)]] (duration: 09m 38s) [21:17:48] T325007: [regression,subtask] page preview doesn't display when mouse hovers over wikilinked text that has italic and / or span (and perhaps other html) markup - https://phabricator.wikimedia.org/T325007 [21:18:16] kindrobot: all yours for zabe's 863228 [21:18:27] Great. You ready zabe? [21:18:31] yep [21:19:09] Great, just a moment. [21:20:02] (03CR) 10Andrew Bogott: [C: 03+2] rsyslog::receiver: fix cert_file when used with acme certs [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [21:20:22] (03CR) 10Southparkfan: "For historical purposes, the reasoning behind choosing 'alt.chained.crt' instead of 'chained.crt':" [puppet] - 10https://gerrit.wikimedia.org/r/867709 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [21:20:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863228 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:21:32] (03Merged) 10jenkins-bot: Start writing to cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863228 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:21:57] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:863228|Start writing to cul_actor everywhere (T233004)]] [21:22:01] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:23:29] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2035.codfw.wmnet with OS bullseye [21:23:42] !log kindrobot@deploy1002 kindrobot and zabe: Backport for [[gerrit:863228|Start writing to cul_actor everywhere (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:24:20] zabe: it's live on mwdebug; can you test? [21:24:21] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10decommission-hardware, 10serviceops-collab: decommission contint1001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T325102 (10Dzahn) [21:24:35] yes [21:24:45] could you do a query for me? [21:24:52] Sure [21:26:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10Dzahn) Thanks, technically "role approver" and "manager" can indeed be separate people and separate approvals. [21:26:13] select * from cu_log where cul_user_text="Zabe"; [21:26:20] ^ could you run that in enwiki [21:26:55] and post it to a wmf-nda protedted paste [21:28:18] Oh actually, I've never done a production db query before, and that table is not in the replica. I'm not sure if I have access. TheresNoTime could you please help? [21:29:22] kindrobot: you can literally just `sql enwiki` [21:29:25] and then run the query [21:29:40] Ah, OK. [21:29:44] if the table has many columns, you might not want to use ; termination though, as the format can be pants [21:30:23] (might want to use \G) [21:30:44] (still needed?) [21:30:57] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2034.codfw.wmnet with OS bullseye [21:31:26] Great. Query successful. Is there something I need to do special for a wmf-nda paste? [21:31:53] As long as you use the... special starting link, no [21:32:20] ala https://phabricator.wikimedia.org/paste/edit/form/36/ [21:32:35] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2033.codfw.wmnet with OS bullseye [21:32:48] Thank you! [21:33:07] (03PS10) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [21:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:00] zabe: https://phabricator.wikimedia.org/P42690 [21:34:18] (03CR) 10Herron: [C: 03+1] logstash: reformat python escape sequences before parsing k8s logs [puppet] - 10https://gerrit.wikimedia.org/r/867629 (https://phabricator.wikimedia.org/T325085) (owner: 10Cwhite) [21:34:34] (03CR) 10Eevans: [C: 03+2] Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [21:34:37] kindrobot, thanks, lgtm [21:34:59] Great, continuing the deploy. [21:35:44] !log Deploying analytics/refinery (HDFS FSImage conversion to XML script) [21:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:58] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2035.codfw.wmnet with reason: host reimage [21:40:23] !log aqu@deploy1002 Started deploy [analytics/refinery@66736e1]: HDFS FSImage conversion to XML script [analytics/refinery@66736e1] [21:40:44] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:863228|Start writing to cul_actor everywhere (T233004)]] (duration: 18m 47s) [21:40:47] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:41:42] !log Finishing UTC late backport window [21:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] Thanks Reedy :) [21:43:05] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2035.codfw.wmnet with reason: host reimage [21:43:25] (03PS3) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [21:43:45] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [21:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:44:38] (03PS4) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [21:44:43] (03PS3) 10Reedy: CommonSettings.php: ExtensionDistributor updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) [21:44:48] (03CR) 10Reedy: [C: 03+2] CommonSettings.php: ExtensionDistributor updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) (owner: 10Reedy) [21:45:04] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [21:45:26] (03Merged) 10jenkins-bot: CommonSettings.php: ExtensionDistributor updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) (owner: 10Reedy) [21:47:18] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2034.codfw.wmnet with reason: host reimage [21:48:53] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2033.codfw.wmnet with reason: host reimage [21:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:50:25] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2034.codfw.wmnet with reason: host reimage [21:52:37] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: extension distributor updates (duration: 06m 50s) [21:53:01] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2033.codfw.wmnet with reason: host reimage [21:55:11] PROBLEM - cassandra-b service on cassandra-dev2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:56:05] PROBLEM - cassandra-a CQL 10.192.32.84:9042 on cassandra-dev2002 is CRITICAL: connect to address 10.192.32.84 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:56:21] RECOVERY - cassandra-b service on cassandra-dev2001 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:57:51] PROBLEM - cassandra-a SSL 10.192.32.84:7001 on cassandra-dev2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:59:37] PROBLEM - cassandra-a service on cassandra-dev2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:01] PROBLEM - cassandra-b service on cassandra-dev2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:17] PROBLEM - Check systemd state on cassandra-dev2001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:35] (03CR) 10Cwhite: [C: 03+2] logstash: reformat python escape sequences before parsing k8s logs [puppet] - 10https://gerrit.wikimedia.org/r/867629 (https://phabricator.wikimedia.org/T325085) (owner: 10Cwhite) [22:00:54] (03PS4) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) [22:00:59] PROBLEM - cassandra-b CQL 10.192.32.85:9042 on cassandra-dev2002 is CRITICAL: connect to address 10.192.32.85 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:02:23] PROBLEM - cassandra-b SSL 10.192.32.85:7001 on cassandra-dev2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:03:10] (03PS5) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) [22:03:41] PROBLEM - cassandra-b service on cassandra-dev2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:05:29] RECOVERY - cassandra-b service on cassandra-dev2001 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:05:45] RECOVERY - Check systemd state on cassandra-dev2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [22:06:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:56] !log aqu@deploy1002 Finished deploy [analytics/refinery@66736e1]: HDFS FSImage conversion to XML script [analytics/refinery@66736e1] (duration: 26m 32s) [22:08:33] !log aqu@deploy1002 Started deploy [analytics/refinery@66736e1] (thin): HDFS FSImage conversion to XML script THIN [analytics/refinery@66736e1] [22:08:40] !log aqu@deploy1002 Finished deploy [analytics/refinery@66736e1] (thin): HDFS FSImage conversion to XML script THIN [analytics/refinery@66736e1] (duration: 00m 07s) [22:09:03] !log aqu@deploy1002 Started deploy [analytics/refinery@66736e1] (hadoop-test): HDFS FSImage conversion to XML script TEST [analytics/refinery@66736e1] [22:10:14] !log aqu@deploy1002 Finished deploy [analytics/refinery@66736e1] (hadoop-test): HDFS FSImage conversion to XML script TEST [analytics/refinery@66736e1] (duration: 01m 11s) [22:11:15] o/ hello, is anything being deployed right now? seems the fix for a UBN (popups extension) that was just deployed needs to be reverted. [22:11:55] jouncebot: nowandnext [22:11:55] No deployments scheduled for the next 9 hour(s) and 48 minute(s) [22:11:55] In 9 hour(s) and 48 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0800) [22:12:04] jan_drewniak: looks clear [22:12:24] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2034.codfw.wmnet with OS bullseye [22:13:39] PROBLEM - cassandra-b service on cassandra-dev2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:59] PROBLEM - Check systemd state on cassandra-dev2001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:06] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2033.codfw.wmnet with OS bullseye [22:16:14] (03CR) 10Bking: [C: 03+2] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/867646 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [22:17:49] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on cassandra-dev2003 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:18:55] PROBLEM - cassandra-a CQL 10.192.16.14:9042 on cassandra-dev2001 is CRITICAL: connect to address 10.192.16.14 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:19:23] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on cassandra-dev2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:20:33] RECOVERY - cassandra-a CQL 10.192.16.14:9042 on cassandra-dev2001 is OK: TCP OK - 0.032 second response time on 10.192.16.14 port 9042 https://phabricator.wikimedia.org/T93886 [22:20:35] PROBLEM - cassandra-a SSL 10.192.16.14:7001 on cassandra-dev2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:20:45] RECOVERY - cassandra-a service on cassandra-dev2002 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:21:07] PROBLEM - cassandra-a service on cassandra-dev2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:21:14] (03PS1) 10Jdlrobson: Account for syntax errors in closest selector [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867617 (https://phabricator.wikimedia.org/T325113) [22:21:35] (03PS1) 10Jdlrobson: Account for syntax errors in closest selector [extensions/Popups] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867618 (https://phabricator.wikimedia.org/T325113) [22:22:35] RECOVERY - cassandra-a CQL 10.192.32.84:9042 on cassandra-dev2002 is OK: TCP OK - 0.033 second response time on 10.192.32.84 port 9042 https://phabricator.wikimedia.org/T93886 [22:22:45] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on cassandra-dev2003 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:23:15] RECOVERY - cassandra-b service on cassandra-dev2001 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:23:39] RECOVERY - Check systemd state on cassandra-dev2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:49] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2035.codfw.wmnet with OS bullseye [22:24:27] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on cassandra-dev2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:25:01] PROBLEM - cassandra-b SSL 10.192.16.15:7001 on cassandra-dev2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:25:45] RECOVERY - cassandra-b service on cassandra-dev2002 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:26:05] PROBLEM - cassandra-b service on cassandra-dev2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:26:37] RECOVERY - cassandra-b CQL 10.192.32.85:9042 on cassandra-dev2002 is OK: TCP OK - 0.033 second response time on 10.192.32.85 port 9042 https://phabricator.wikimedia.org/T93886 [22:27:55] 10SRE, 10DNS, 10Phabricator, 10Traffic-Icebox, 10Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252 (10Dzahn) @MC8 Hi! Given how you, as the ticket creator, even originally said "might be nice", the time that has passed since then... [22:28:09] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on cassandra-dev2003 is OK: TCP OK - 0.033 second response time on 10.192.48.54 port 9042 https://phabricator.wikimedia.org/T93886 [22:28:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867617 (https://phabricator.wikimedia.org/T325113) (owner: 10Jdlrobson) [22:28:15] RECOVERY - cassandra-a service on cassandra-dev2003 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:28:33] 10SRE, 10DNS, 10Phabricator, 10Traffic-Icebox, and 2 others: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252 (10Dzahn) [22:29:14] Ok, I'm backporting an extension/Popup patch to wmf/1.40.0-wmf.13 and wmf/1.40.0-wmf.14 [22:29:43] RECOVERY - cassandra-b service on cassandra-dev2003 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:31:04] (03PS46) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [22:31:14] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:31:45] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on cassandra-dev2003 is OK: TCP OK - 0.033 second response time on 10.192.48.55 port 9042 https://phabricator.wikimedia.org/T93886 [22:32:40] 10SRE, 10DNS, 10Phabricator, 10Traffic-Icebox, and 2 others: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252 (10MC8) Yes, I think that's sensible. [22:32:47] (03Merged) 10jenkins-bot: Account for syntax errors in closest selector [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867617 (https://phabricator.wikimedia.org/T325113) (owner: 10Jdlrobson) [22:33:11] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:867617|Account for syntax errors in closest selector (T325113)]] [22:33:15] T325113: [subtask] SyntaxError: Element.closest: '#mw-content-text a[href][title]:not(.extiw, .mw-selflink, .image, .new, .internal, .external, .mw-cite-backlink a, .oo-ui-buttonedElement-button, .ve-ce-surface a, .cancelLink a), #mw-content-text .reference a[ href*="#" ]' is not a valid selector - https://phabricator.wikimedia.org/T325113 [22:34:18] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:34:32] (03PS4) 10DLynch: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) [22:34:58] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:867617|Account for syntax errors in closest selector (T325113)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:35:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:38:55] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "parse1002: failed -> active - dzahn@cumin2002" [22:39:01] (03PS1) 10DLynch: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) [22:40:14] (03PS1) 10DLynch: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) [22:40:15] !log netbox: set parse1002 status: failed -> active in web UI; ran cookbook 'sre.puppet.sync-netbox-hiera' to get data in sync - T324949 [22:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "parse1002: failed -> active - dzahn@cumin2002" [22:40:19] T324949: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 [22:42:31] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:867617|Account for syntax errors in closest selector (T325113)]] (duration: 09m 20s) [22:42:35] T325113: [subtask] SyntaxError: Element.closest: '#mw-content-text a[href][title]:not(.extiw, .mw-selflink, .image, .new, .internal, .external, .mw-cite-backlink a, .oo-ui-buttonedElement-button, .ve-ce-surface a, .cancelLink a), #mw-content-text .reference a[ href*="#" ]' is not a valid selector - https://phabricator.wikimedia.org/T325113 [22:43:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [extensions/Popups] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867618 (https://phabricator.wikimedia.org/T325113) (owner: 10Jdlrobson) [22:48:32] (03Merged) 10jenkins-bot: Account for syntax errors in closest selector [extensions/Popups] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867618 (https://phabricator.wikimedia.org/T325113) (owner: 10Jdlrobson) [22:48:57] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:867618|Account for syntax errors in closest selector (T325113)]] [22:49:00] T325113: [subtask] SyntaxError: Element.closest: '#mw-content-text a[href][title]:not(.extiw, .mw-selflink, .image, .new, .internal, .external, .mw-cite-backlink a, .oo-ui-buttonedElement-button, .ve-ce-surface a, .cancelLink a), #mw-content-text .reference a[ href*="#" ]' is not a valid selector - https://phabricator.wikimedia.org/T325113 [22:50:06] (03PS1) 10Sbailey: enable Linter extension maintNamespace.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867724 (https://phabricator.wikimedia.org/T299612) [22:50:49] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:867618|Account for syntax errors in closest selector (T325113)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [22:56:29] (03CR) 10Subramanya Sastry: [C: 03+2] enable Linter extension maintNamespace.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867724 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [22:57:09] (03Merged) 10jenkins-bot: enable Linter extension maintNamespace.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867724 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [22:57:26] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:867618|Account for syntax errors in closest selector (T325113)]] (duration: 08m 29s) [22:57:31] T325113: [subtask] SyntaxError: Element.closest: '#mw-content-text a[href][title]:not(.extiw, .mw-selflink, .image, .new, .internal, .external, .mw-cite-backlink a, .oo-ui-buttonedElement-button, .ve-ce-surface a, .cancelLink a), #mw-content-text .reference a[ href*="#" ]' is not a valid selector - https://phabricator.wikimedia.org/T325113 [23:13:56] 10SRE, 10DNS, 10Phabricator, 10Traffic-Icebox, and 2 others: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252 (10Dzahn) @MC8 Ok, thank you. I really wanted to check in with you first because I know how it can be frustrating to file a ticket and the... [23:14:05] 10SRE, 10DNS, 10Phabricator, 10Traffic-Icebox, and 2 others: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252 (10Dzahn) 05Open→03Declined [23:21:28] (03PS3) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) [23:22:27] (03CR) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:22:55] (03CR) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:23:36] (03PS2) 10Andrea Denisse: netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) [23:24:11] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [23:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown