[00:17:19] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:37] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:37] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:27] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) I can confirm this behaviour. When using systemd::sysuser on a new host it does not create the home dir. (I just started using this for phab hosts and the phd user... [00:52:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:54:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:00:11] (03PS1) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) [01:00:47] (03CR) 10CI reject: [V: 04-1] Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [01:02:09] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:48] (03PS1) 10Tim Starling: SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274) [01:08:31] (03CR) 10Tim Starling: [C: 03+2] SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274) (owner: 10Tim Starling) [01:24:45] (03Merged) 10jenkins-bot: SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274) (owner: 10Tim Starling) [01:29:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:30:59] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/libs/rdbms/database/DBConnRef.php: fix potential mainstash exception file 1 T315274 (duration: 03m 21s) [01:31:03] T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). - https://phabricator.wikimedia.org/T315274 [01:31:59] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:36:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:37:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:59] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix potential mainstash exception file 2 T315274 (duration: 03m 30s) [01:38:03] T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). - https://phabricator.wikimedia.org/T315274 [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:08] (03PS2) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:01] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:56:05] (03CR) 10Tim Starling: "Puppet compiler result: https://puppet-compiler.wmflabs.org/pcc-worker1001/36833/" [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [02:03:19] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:09] 10SRE-OnFire, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Wikimedia-Incident, 10Wikimedia-production-error: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomi... - https://phabricator.wikimedia.org/T315274 [02:06:43] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:29] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:07] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:47] (03PS6) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 [02:23:23] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:21] (03CR) 10Ori: "PS 6: moved normalize_request_nonmisc below vcl_init. Otherwise we get a 'Symbol not found: cache_local' error from VCC-Compiler. This is " [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [03:23:49] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:55] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:01] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:53] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:17] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:03] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:05] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:27] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:05] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:57:27] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:26:02] !log hashar@deploy1002 Started deploy [integration/docroot@09eb565]: zuul: Fix/remove links to non-existent Grafana graphs - T307405 [04:26:07] T307405: Broken dashboard links on Zuul Status page - https://phabricator.wikimedia.org/T307405 [04:26:15] !log hashar@deploy1002 Finished deploy [integration/docroot@09eb565]: zuul: Fix/remove links to non-existent Grafana graphs - T307405 (duration: 00m 13s) [04:39:39] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:47] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:11] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:53:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:59] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:01] !log Install 10.6.9 on db2122 and db2146 [05:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:07] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:09] (03PS1) 10Marostegui: install_server: Do not reimage db2177-db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824583 (https://phabricator.wikimedia.org/T311494) [05:25:04] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2177-db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824583 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [05:27:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:28:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:28:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32543 and previous config saved to /var/cache/conftool/dbconfig/20220819-052900-marostegui.json [05:29:04] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [05:31:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32544 and previous config saved to /var/cache/conftool/dbconfig/20220819-053110-marostegui.json [05:31:35] (03PS1) 10Marostegui: db2164: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584 [05:33:01] (03PS2) 10Marostegui: db2152: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584 [05:34:09] (03CR) 10Marostegui: [C: 03+2] db2152: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584 (owner: 10Marostegui) [05:36:40] (03PS1) 10Marostegui: mariadb: Productionize db2181 [puppet] - 10https://gerrit.wikimedia.org/r/824585 (https://phabricator.wikimedia.org/T311494) [05:37:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2181 [puppet] - 10https://gerrit.wikimedia.org/r/824585 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [05:46:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P32546 and previous config saved to /var/cache/conftool/dbconfig/20220819-054616-marostegui.json [05:48:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:54:06] (03PS1) 10Marostegui: mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411) [05:57:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:07] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P32547 and previous config saved to /var/cache/conftool/dbconfig/20220819-060122-marostegui.json [06:06:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui) [06:06:33] (03Merged) 10jenkins-bot: mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui) [06:12:37] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P32548 and previous config saved to /var/cache/conftool/dbconfig/20220819-061515-root.json [06:16:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32549 and previous config saved to /var/cache/conftool/dbconfig/20220819-061628-marostegui.json [06:16:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:16:33] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:16:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:16:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32550 and previous config saved to /var/cache/conftool/dbconfig/20220819-061649-marostegui.json [06:18:11] (03PS1) 10Giuseppe Lavagetto: admin: (oblivian) add helper functions to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/824588 [06:21:23] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:22] (03CR) 10Hashar: [V: 03+1] "I have tested it locally with Gerrit 3.4 and this does not alter the rendering. There is no voteChip class and the styling is still done b" [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [06:30:55] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: (oblivian) add helper functions to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/824588 (owner: 10Giuseppe Lavagetto) [06:39:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32551 and previous config saved to /var/cache/conftool/dbconfig/20220819-063903-marostegui.json [06:39:08] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P32552 and previous config saved to /var/cache/conftool/dbconfig/20220819-065409-marostegui.json [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220819T0700) [07:03:04] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10cmooney) Hi @Papaul My apologies that's due to me, as soon as I can get https://gerrit.wikimedia.org/r/c/operations/dns/+/824572 merged I'll sort it. [07:08:21] (03CR) 10Marostegui: [C: 03+1] auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [07:09:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P32553 and previous config saved to /var/cache/conftool/dbconfig/20220819-070916-marostegui.json [07:10:06] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [07:10:54] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup) [07:11:25] (03Merged) 10jenkins-bot: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup) [07:11:28] (03Merged) 10jenkins-bot: auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [07:12:22] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Ladsgroup) [07:18:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:19:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:19:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:19:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32555 and previous config saved to /var/cache/conftool/dbconfig/20220819-071934-ladsgroup.json [07:19:38] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:20:33] !log killing cswiki's refreshlinksrecom script T299021 [07:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:36] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [07:24:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32556 and previous config saved to /var/cache/conftool/dbconfig/20220819-072422-marostegui.json [07:24:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:24:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:24:27] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:24:39] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312972)', diff saved to https://phabricator.wikimedia.org/P32557 and previous config saved to /var/cache/conftool/dbconfig/20220819-072800-marostegui.json [07:32:23] (03PS1) 10Marostegui: mariadb: Productionize db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824682 (https://phabricator.wikimedia.org/T311494) [07:34:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824682 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:43:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P32558 and previous config saved to /var/cache/conftool/dbconfig/20220819-074306-marostegui.json [07:46:29] (03CR) 10Marostegui: "This looks good, but I would like to ask Jaime for his thoughts too" [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [07:58:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P32559 and previous config saved to /var/cache/conftool/dbconfig/20220819-075812-marostegui.json [08:04:56] (03CR) 10Jcrespo: [C: 03+1] Factor out x2 per-host hieradata into an objectstash role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [08:06:16] (03PS1) 10Phuedx: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 [08:11:54] (03CR) 10MVernon: [C: 03+2] Add user eevans to ops group [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans) [08:12:45] (03PS2) 10Phuedx: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 [08:13:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312972)', diff saved to https://phabricator.wikimedia.org/P32561 and previous config saved to /var/cache/conftool/dbconfig/20220819-081317-marostegui.json [08:13:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:13:22] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:13:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:13:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:13:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:13:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32562 and previous config saved to /var/cache/conftool/dbconfig/20220819-081356-marostegui.json [08:15:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:50] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32563 and previous config saved to /var/cache/conftool/dbconfig/20220819-081606-marostegui.json [08:16:43] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2067.codfw.wmnet [08:16:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2067.codfw.wmnet [08:18:12] PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:25] (03PS1) 10MVernon: swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049) [08:26:24] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:31:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P32564 and previous config saved to /var/cache/conftool/dbconfig/20220819-083112-marostegui.json [08:31:24] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:33:59] (03PS2) 10MMandere: utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) [08:34:20] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:49] (03CR) 10CI reject: [V: 04-1] utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [08:35:39] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon) [08:38:20] (03PS1) 10Marostegui: site.pp: Remove insetup from db2181 and db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824687 (https://phabricator.wikimedia.org/T311494) [08:38:43] (03CR) 10MMandere: utils: Add latency measurement program (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [08:38:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I would probably add a TODO to go back to the current form once we've got rid of buster, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:40:04] (03CR) 10Filippo Giunchedi: [C: 03+2] docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:40:06] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36835/console" [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans) [08:40:16] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2181 and db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824687 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [08:40:25] (03CR) 10Filippo Giunchedi: [C: 03+2] service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:40:35] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36835/console" [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans) [08:40:52] (03CR) 10MVernon: [C: 03+2] swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon) [08:40:54] (03PS3) 10Filippo Giunchedi: docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) [08:42:44] (03PS3) 10Filippo Giunchedi: service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) [08:43:32] (03CR) 10Filippo Giunchedi: [C: 03+2] postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:43:35] (03PS2) 10Filippo Giunchedi: postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) [08:44:00] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:01] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [08:44:05] (03PS1) 10Clément Goubert: icinga: add cgoubert to the right groups in icinga [puppet] - 10https://gerrit.wikimedia.org/r/824689 [08:44:50] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:52] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P32565 and previous config saved to /var/cache/conftool/dbconfig/20220819-084618-marostegui.json [08:46:46] RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:37] (03CR) 10Jbond: [C: 03+1] "np, response inline" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [08:47:44] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36836/console" [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert) [08:50:00] (03CR) 10Jbond: [C: 03+1] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) (owner: 10Ryan Kemper) [08:51:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert) [08:51:35] (03CR) 10Jbond: "duplicat https://gerrit.wikimedia.org/r/c/operations/puppet/+/824568" [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [08:52:18] (03CR) 10Jbond: [V: 03+1] O:phabricator: move common settings to role hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [08:52:49] (03CR) 10Clément Goubert: [C: 03+2] pcc: Encode jenkins username to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/824209 (owner: 10Clément Goubert) [08:52:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/824572 (https://phabricator.wikimedia.org/T315429) (owner: 10Cathal Mooney) [08:53:14] (03CR) 10Jbond: phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [08:54:45] (03PS1) 10ArielGlenn: add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736) [08:55:48] (03CR) 10Jbond: [C: 03+1] postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:56:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [08:56:55] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [08:57:50] (03CR) 10FNegri: [C: 03+2] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:59:09] (03CR) 10FNegri: [C: 03+2] global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32566 and previous config saved to /var/cache/conftool/dbconfig/20220819-090124-marostegui.json [09:01:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:01:29] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:01:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32567 and previous config saved to /var/cache/conftool/dbconfig/20220819-090146-marostegui.json [09:02:03] (03CR) 10Hashar: doc: properly redirect back compat URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [09:02:28] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:34] PROBLEM - Check systemd state on wdqs1014 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32568 and previous config saved to /var/cache/conftool/dbconfig/20220819-090456-marostegui.json [09:05:43] (03CR) 10FNegri: [C: 03+2] Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:07:30] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:50] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) p:05Triage→03Low [09:09:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:09:15] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) a:03cmooney [09:10:03] (03CR) 10Cathal Mooney: [C: 03+2] Add include statement for 2001:df2:e500:fe07::/64 reverse entries [dns] - 10https://gerrit.wikimedia.org/r/824572 (https://phabricator.wikimedia.org/T315429) (owner: 10Cathal Mooney) [09:11:43] !log running authdns-update on auth1001 to add new include to 0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa. zone [09:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:20] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:14:26] (03CR) 10CI reject: [V: 04-1] Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:14:28] (03CR) 10CI reject: [V: 04-1] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:16:14] (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [09:16:58] (03PS5) 10FNegri: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:17:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:18:56] (03PS4) 10FNegri: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:19:52] (03CR) 10FNegri: [C: 03+2] global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:20:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P32569 and previous config saved to /var/cache/conftool/dbconfig/20220819-092002-marostegui.json [09:20:29] (03PS5) 10FNegri: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:20:43] (03PS5) 10FNegri: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:20:50] (03PS5) 10FNegri: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:20:57] (03PS5) 10FNegri: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:21:03] (03PS5) 10FNegri: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:21:09] (03PS7) 10FNegri: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:21:27] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:21:53] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:22:05] (03CR) 10FNegri: [C: 03+2] ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:22:25] (03CR) 10FNegri: [C: 03+2] ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:22:36] (03CR) 10FNegri: [C: 03+2] ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:22:47] (03CR) 10FNegri: [C: 03+2] ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:23:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [09:26:05] (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:26:11] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10cmooney) 05Open→03Resolved Ok got the +1 and merged. All is good now. ` cmooney@cumin1001:~$ dig +short A kubernetes2024.codfw.wmnet @ns0.wikimedia.org 10.192.48.87 cmooney... [09:26:21] (03Merged) 10jenkins-bot: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:27:08] RECOVERY - Check systemd state on wdqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:24] (03Merged) 10jenkins-bot: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:28:24] (03CR) 10FNegri: [C: 03+2] cloud: reformat cloud.yaml with prettier [puppet] - 10https://gerrit.wikimedia.org/r/824421 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:29:55] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) [09:33:07] (03Merged) 10jenkins-bot: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:33:09] (03Merged) 10jenkins-bot: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:34:49] (03Merged) 10jenkins-bot: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:34:51] (03Merged) 10jenkins-bot: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P32570 and previous config saved to /var/cache/conftool/dbconfig/20220819-093508-marostegui.json [09:36:40] (03PS1) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 [09:39:20] (03Merged) 10jenkins-bot: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:40:38] (03CR) 10CI reject: [V: 04-1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez) [09:41:26] (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:45:14] (03PS1) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) [09:46:58] (03PS1) 10Btullis: Add dummy tokens for dse_k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824695 (https://phabricator.wikimedia.org/T310196) [09:47:00] (03PS1) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) [09:47:19] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy tokens for dse_k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824695 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:49:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36838/console" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [09:50:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32571 and previous config saved to /var/cache/conftool/dbconfig/20220819-095014-marostegui.json [09:50:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:50:19] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:50:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32572 and previous config saved to /var/cache/conftool/dbconfig/20220819-095035-marostegui.json [09:51:24] (03PS2) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 [09:53:00] (03CR) 10JMeybohm: [C: 04-1] Add new admin_ng values for the dse-k8s-eqiad cluster (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:55:23] (03CR) 10CI reject: [V: 04-1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez) [09:55:57] uhl... that's passing locally [09:56:41] oh course... I offended pep8 with a 102 chars long line.. I should be executed right now [09:57:56] (03PS3) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 [10:00:58] (03PS1) 10Btullis: get_cert [puppet] - 10https://gerrit.wikimedia.org/r/824697 [10:02:59] (03PS3) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [10:03:01] (03PS3) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [10:03:08] (03CR) 10Filippo Giunchedi: WIP dispatch: add database role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:05:05] (03PS2) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) [10:05:31] (03Abandoned) 10Btullis: get_cert [puppet] - 10https://gerrit.wikimedia.org/r/824697 (owner: 10Btullis) [10:06:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32573 and previous config saved to /var/cache/conftool/dbconfig/20220819-100633-ladsgroup.json [10:06:38] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:13:08] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32574 and previous config saved to /var/cache/conftool/dbconfig/20220819-101348-marostegui.json [10:13:53] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P32575 and previous config saved to /var/cache/conftool/dbconfig/20220819-102139-ladsgroup.json [10:27:52] (03PS1) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) [10:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P32576 and previous config saved to /var/cache/conftool/dbconfig/20220819-102854-marostegui.json [10:32:44] (03PS2) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) [10:34:00] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P32577 and previous config saved to /var/cache/conftool/dbconfig/20220819-103645-ladsgroup.json [10:37:10] (03CR) 10Btullis: "I'm not sure exactly which infrastructure_users I should add the to dse_k8s block here, so I have taken best guess." [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:44:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P32578 and previous config saved to /var/cache/conftool/dbconfig/20220819-104400-marostegui.json [10:48:44] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] icinga: add cgoubert to the right groups in icinga [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert) [10:51:37] (03PS1) 10Hnowlan: api-gateway: disable shipping logs to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/824703 [10:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32579 and previous config saved to /var/cache/conftool/dbconfig/20220819-105151-ladsgroup.json [10:51:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:51:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:52:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32580 and previous config saved to /var/cache/conftool/dbconfig/20220819-105212-ladsgroup.json [10:54:40] (03PS1) 10Btullis: Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172) [10:54:48] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:56:21] (03CR) 10Giuseppe Lavagetto: Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:59:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32581 and previous config saved to /var/cache/conftool/dbconfig/20220819-105906-marostegui.json [10:59:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:59:10] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:59:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:59:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:59:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32582 and previous config saved to /var/cache/conftool/dbconfig/20220819-105934-marostegui.json [11:01:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32583 and previous config saved to /var/cache/conftool/dbconfig/20220819-110145-marostegui.json [11:02:40] (03CR) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:11:36] (03PS2) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) [11:15:41] (03PS3) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) [11:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P32584 and previous config saved to /var/cache/conftool/dbconfig/20220819-111651-marostegui.json [11:17:14] PROBLEM - Check systemd state on db2114 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:10] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] icinga: add cgoubert to the right groups in icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert) [11:20:00] (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [11:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P32586 and previous config saved to /var/cache/conftool/dbconfig/20220819-113157-marostegui.json [11:32:54] (03PS1) 10Jbond: C:vopsbot: correct data dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/824713 [11:34:33] (03CR) 10Jbond: [C: 03+2] C:vopsbot: correct data dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/824713 (owner: 10Jbond) [11:42:23] 10SRE, 10SRE-OnFire, 10Observability-Alerting: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10jbond) [11:43:55] (03CR) 10Jbond: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [11:47:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32587 and previous config saved to /var/cache/conftool/dbconfig/20220819-114703-marostegui.json [11:47:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:47:08] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [11:47:12] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) Ok so I got some results back. Firstly 10,000 pings to cr1-drmrs from bast2002, starting at 08:14 UTC. Average RTT was 118ms, worst was 154ms: `... [11:47:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:47:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:47:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:47:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 11 hosts with reason: Maintenance [11:48:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 11 hosts with reason: Maintenance [11:49:20] (03PS4) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) [11:49:46] (03CR) 10David Caro: p:ceph::osd: get the os disks by size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:50:04] RECOVERY - Check systemd state on db2114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:52] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10dcaro) That seemed to do the trick yes! Thanks! [11:56:02] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:40] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10fgiunchedi) Thank you for the followup! LGTM and working as expected now [12:21:26] (03PS1) 10Btullis: Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) [12:23:33] (03PS2) 10Krinkle: Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz) [12:24:02] (03CR) 10Btullis: "I have updated the documentation a little here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#File_cfssl-issuer-values.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [12:28:20] (03PS1) 10Btullis: Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196) [12:32:43] (03PS3) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) [12:35:58] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [12:37:41] (03CR) 10Krinkle: [C: 03+2] Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz) [12:39:11] (03Merged) 10jenkins-bot: Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz) [12:43:01] (03PS1) 10Jelto: gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) [12:44:46] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I0c45b657d9ee7efe (duration: 03m 24s) [12:45:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:46:43] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36842/console" [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:46:45] (03PS2) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) [12:49:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:49:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:50:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:55:16] 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle) [12:59:30] 10SRE, 10serviceops: Move "redis_sessions" to "redis_misc" cluster - https://phabricator.wikimedia.org/T280586 (10Krinkle) 05Open→03Declined Declining as it has been obsoleted. With T314453 done, the last consumer is gone from this. There are now no references left in wmf-config to the redis_sessions clust... [12:59:34] 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle) [13:05:03] (03PS1) 10Krinkle: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581) [13:05:24] (03PS2) 10Krinkle: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581) [13:08:42] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 6 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lydia_Pintscher) [13:09:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:09:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:10:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance [13:10:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance [13:10:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:11:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:11:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:11:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32588 and previous config saved to /var/cache/conftool/dbconfig/20220819-131139-marostegui.json [13:11:44] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:11:58] (03PS1) 10Krinkle: Remove references to now-empty redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824736 (https://phabricator.wikimedia.org/T267581) [13:12:00] (03PS1) 10Krinkle: redis: Remove now-empty and unreferenced redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824737 (https://phabricator.wikimedia.org/T267581) [13:14:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32589 and previous config saved to /var/cache/conftool/dbconfig/20220819-131359-marostegui.json [13:16:05] (03CR) 10Ssingh: [C: 03+1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez) [13:19:21] (03PS1) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) [13:21:43] (03CR) 10Vgutierrez: [C: 03+2] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez) [13:25:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:13] (03CR) 10Andrew Bogott: [C: 03+1] P:systemd::timesyncd: allow overriding the protectsystem systemd param [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:27:32] (03PS2) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) [13:28:36] (03CR) 10FNegri: [C: 03+1] "Looks reasonably safe, and probably safer than what we have now. ;)" [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:29:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32590 and previous config saved to /var/cache/conftool/dbconfig/20220819-132905-marostegui.json [13:30:34] (03CR) 10FNegri: [C: 03+1] ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:31:20] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T315344 (10Cmjohnson) 05Open→03Declined well aware of this [13:32:06] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:57] 10SRE, 10ops-eqiad, 10DC-Ops: dbprov1002 lost power redundancy - https://phabricator.wikimedia.org/T315439 (10Cmjohnson) this is a loose power cable, I was in this rack trying to adjust power because it's alerting. I will fix this today. [13:34:10] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36844/console" [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:37:22] (03PS1) 10Jdrewniak: Add back fixed width to main content [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) [13:44:05] 👋 happy Friday folks, it looks like the Web team has a bit of an emergency deploy situation on our hands https://phabricator.wikimedia.org/T315653 [13:44:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32591 and previous config saved to /var/cache/conftool/dbconfig/20220819-134411-marostegui.json [13:45:06] !log Install 10.4.26 on db2111 db2148 db2124 [13:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:31] jouncebot: now [13:45:31] For the next 17 hour(s) and 14 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220819T0700) [13:46:35] thcipriani, dancy, ^demon: please see jan_drewniak's comment as Deployment/Emergencies requires releng [13:55:57] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10Papaul) @cmooney thank you my changes are now on the DNS server. [13:56:35] (03PS1) 10Jforrester: TranslatableBundleLogFormatter: Cast reason to string before passing it [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) [13:57:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED [13:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32592 and previous config saved to /var/cache/conftool/dbconfig/20220819-135917-marostegui.json [13:59:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:59:22] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:59:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:59:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:59:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:59:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32593 and previous config saved to /var/cache/conftool/dbconfig/20220819-135956-marostegui.json [14:01:14] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32594 and previous config saved to /var/cache/conftool/dbconfig/20220819-140216-marostegui.json [14:04:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED [14:13:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2024'] [14:17:20] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32595 and previous config saved to /var/cache/conftool/dbconfig/20220819-141722-marostegui.json [14:19:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2024'] [14:21:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [14:22:55] (03PS1) 10Andrew Bogott: wmcs_backup_instances.yaml: move all VM backups to new cloudbackup hosts [puppet] - 10https://gerrit.wikimedia.org/r/824747 (https://phabricator.wikimedia.org/T302535) [14:23:55] (03CR) 10Herron: "Very nice! LGTM pending followup on Filippo's comments" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [14:25:53] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10ori) [14:27:11] (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_instances.yaml: move all VM backups to new cloudbackup hosts [puppet] - 10https://gerrit.wikimedia.org/r/824747 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [14:28:36] (03PS1) 10Papaul: Add kubernetes202[34] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824748 (https://phabricator.wikimedia.org/T313870) [14:29:08] (03PS3) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) [14:30:36] (03CR) 10Papaul: [C: 03+2] Add kubernetes202[34] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824748 (https://phabricator.wikimedia.org/T313870) (owner: 10Papaul) [14:32:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32596 and previous config saved to /var/cache/conftool/dbconfig/20220819-143228-marostegui.json [14:32:33] (03CR) 10Herron: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:33:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2023.codfw.wmnet with OS bullseye [14:34:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2023.codfw.wmnet with OS bullseye [14:40:50] jan_drewniak: I am available to help w/ the emergency deployment [14:45:13] dancy: that's would be super great, it's a one line CSS fix but out layout is kinda broken without it https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/824441 we didn't notice earlier because of the train rollback [14:45:26] ok.. If you're ready we can do it now. [14:45:44] dancy: that would be great! [14:45:59] Starting [14:46:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak) [14:47:10] (03CR) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32597 and previous config saved to /var/cache/conftool/dbconfig/20220819-144734-marostegui.json [14:47:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:47:39] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [14:47:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:47:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32598 and previous config saved to /var/cache/conftool/dbconfig/20220819-144755-marostegui.json [14:48:43] (03CR) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [14:48:46] !log dancy@deploy1002 backport aborted: (duration: 03m 01s) [14:48:51] ^(ignore that) [14:50:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak) [14:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32599 and previous config saved to /var/cache/conftool/dbconfig/20220819-145015-marostegui.json [14:52:26] (03PS1) 10Cwhite: beta-logs: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824751 (https://phabricator.wikimedia.org/T305175) [14:52:28] (03PS1) 10Cwhite: logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) [14:52:30] (03PS1) 10Cwhite: beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) [14:52:32] (03PS1) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) [14:52:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2024.codfw.wmnet with OS bullseye [14:52:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2024.codfw.wmnet with OS bullseye [14:53:51] (03CR) 10CI reject: [V: 04-1] beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [14:54:32] (03CR) 10CI reject: [V: 04-1] logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [14:54:42] 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) 05Open→03Resolved All the Netbox entries deleted [14:55:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage [14:56:16] (03PS2) 10Cwhite: beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) [14:56:41] (03PS2) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) [14:59:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage [15:00:14] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Krinkle) [15:00:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Krinkle) [15:04:04] (03PS4) 10Cwhite: tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) [15:04:24] (03Merged) 10jenkins-bot: Add back fixed width to main content [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak) [15:04:30] (03CR) 10Cwhite: tcpircbot: send !log events to log stream (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [15:04:53] !log dancy@deploy1002 Started scap: Backport for [[gerrit:824441|Add back fixed width to main content (T315653)]] [15:04:57] T315653: Regression: fixed width broken on Vector (2022) - https://phabricator.wikimedia.org/T315653 [15:05:05] dancy: 15 min later, finally merged :P [15:05:21] yeah.. :-/ [15:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32600 and previous config saved to /var/cache/conftool/dbconfig/20220819-150521-marostegui.json [15:05:52] jan_drewniak: Alright. Your change is on mwdebug. Test is out [15:05:54] *it [15:07:12] (03PS3) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) [15:07:58] dancy: yup! that's definitely fixed! good to sync :) [15:08:08] Excellent. Proceeding [15:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32601 and previous config saved to /var/cache/conftool/dbconfig/20220819-151053-ladsgroup.json [15:10:58] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:11:02] 10SRE, 10Performance-Team, 10Thumbor, 10Sustainability (Incident Followup): Lower per-IP PoolCounter throttling Thumbor settings - https://phabricator.wikimedia.org/T252426 (10Krinkle) [15:11:52] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:824441|Add back fixed width to main content (T315653)]] (duration: 06m 59s) [15:11:56] T315653: Regression: fixed width broken on Vector (2022) - https://phabricator.wikimedia.org/T315653 [15:12:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:12:09] jan_drewniak: Done [15:12:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage [15:12:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:12:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:13:26] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10Krinkle) [15:13:44] 10SRE, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Krinkle) [15:14:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2023.codfw.wmnet with OS bullseye [15:14:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2023.codfw.wmnet with OS bullseye completed: - kub... [15:14:26] dancy: excellent! thank you so much! and sorry for bugging you on a friday :P [15:14:31] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:14:33] No problem [15:16:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage [15:16:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32602 and previous config saved to /var/cache/conftool/dbconfig/20220819-152027-marostegui.json [15:23:06] !log dancy@deploy1002 Installing scap version "4.14.0" for 556 hosts [15:24:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [15:25:02] !log dancy@deploy1002 Installation of scap version "4.14.0" completed for 556 hosts [15:25:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites - https://phabricator.wikimedia.org/T260449 (10Krinkle) [15:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32603 and previous config saved to /var/cache/conftool/dbconfig/20220819-152559-ladsgroup.json [15:26:03] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Sustainability (Incident Followup): clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 (10Krinkle) [15:27:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2024.codfw.wmnet with OS bullseye [15:28:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2024.codfw.wmnet with OS bullseye completed: - kub... [15:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32604 and previous config saved to /var/cache/conftool/dbconfig/20220819-153533-marostegui.json [15:35:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:35:38] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [15:35:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:35:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32605 and previous config saved to /var/cache/conftool/dbconfig/20220819-153554-marostegui.json [15:36:54] thanks for handling the emergency deploy dancy ! [15:37:10] 👍🏾 [15:37:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32606 and previous config saved to /var/cache/conftool/dbconfig/20220819-153714-marostegui.json [15:37:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [15:38:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) 05Open→03Resolved @akosiaris all yours [15:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32607 and previous config saved to /var/cache/conftool/dbconfig/20220819-154105-ladsgroup.json [15:43:54] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P32608 and previous config saved to /var/cache/conftool/dbconfig/20220819-155220-marostegui.json [15:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32609 and previous config saved to /var/cache/conftool/dbconfig/20220819-155611-ladsgroup.json [15:56:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:56:16] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:56:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:07:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P32610 and previous config saved to /var/cache/conftool/dbconfig/20220819-160726-marostegui.json [16:10:30] 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) p:05High→03Low Thanks yep case opened with JTAC now will keep it open to document any information they may provide. [16:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32611 and previous config saved to /var/cache/conftool/dbconfig/20220819-162232-marostegui.json [16:22:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:22:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [16:22:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:22:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32612 and previous config saved to /var/cache/conftool/dbconfig/20220819-162253-marostegui.json [16:25:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32613 and previous config saved to /var/cache/conftool/dbconfig/20220819-162513-marostegui.json [16:29:22] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:30:23] (03CR) 10Klausman: [C: 03+1] Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [16:31:36] (03CR) 10Klausman: [C: 03+1] Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [16:35:27] (03CR) 10Herron: [C: 03+1] tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [16:38:55] (03PS1) 10Majavah: P:openstack::codfw1dev::db: support TLS [puppet] - 10https://gerrit.wikimedia.org/r/824764 (https://phabricator.wikimedia.org/T310795) [16:38:57] (03PS1) 10Majavah: P:openstack::codfw1dev::db: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/824765 [16:40:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32614 and previous config saved to /var/cache/conftool/dbconfig/20220819-164019-marostegui.json [16:41:43] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [16:46:48] (03PS1) 10Btullis: Add a dummy certificate for dse_k8s [labs/private] - 10https://gerrit.wikimedia.org/r/824767 (https://phabricator.wikimedia.org/T310196) [16:51:52] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy certificate for dse_k8s [labs/private] - 10https://gerrit.wikimedia.org/r/824767 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [16:52:47] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36848/console" [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [16:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32615 and previous config saved to /var/cache/conftool/dbconfig/20220819-165525-marostegui.json [16:59:41] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::codfw1dev::db: support TLS [puppet] - 10https://gerrit.wikimedia.org/r/824764 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah) [17:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32616 and previous config saved to /var/cache/conftool/dbconfig/20220819-171031-marostegui.json [17:10:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:10:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [17:10:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32617 and previous config saved to /var/cache/conftool/dbconfig/20220819-171052-marostegui.json [17:15:20] (03PS1) 10Isaac Johnson: Addition of Varnish logic for setting include_pv cookie on x-analytics and WMF-DP on client. [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) [17:15:57] (03CR) 10CI reject: [V: 04-1] Addition of Varnish logic for setting include_pv cookie on x-analytics and WMF-DP on client. [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:16:54] (03CR) 10Isaac Johnson: "Differential Privacy patch" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:22:51] (03PS2) 10Isaac Johnson: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) [17:23:27] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:34:45] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10RobH) @Papaul, Please note we've ordered (2) replacement SSDs, and the one sourced from Amazon will ship directly to your home (due to Amazon/USPS not being able to deliver to datacenters.) Please test out the... [17:36:20] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:42:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) Thank you! We'll need {T314156} before we can proceed. Thanks! [17:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32618 and previous config saved to /var/cache/conftool/dbconfig/20220819-174317-marostegui.json [17:43:22] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [17:44:56] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10CDanis) Looks like this is resolved...? [17:50:36] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:58:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32619 and previous config saved to /var/cache/conftool/dbconfig/20220819-175823-marostegui.json [18:07:58] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) thanks @jbond, ack. works for me:) [18:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32620 and previous config saved to /var/cache/conftool/dbconfig/20220819-181329-marostegui.json [18:14:22] (03CR) 10Dzahn: [C: 03+2] gerrit: update style for Gerrit 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [18:21:25] (03CR) 10Dzahn: [C: 03+1] gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:27:48] (03PS1) 10CDanis: bump excess concurrency tracking threshold [puppet] - 10https://gerrit.wikimedia.org/r/824775 (https://phabricator.wikimedia.org/T306580) [18:28:08] (03CR) 10CDanis: [C: 03+2] bump excess concurrency tracking threshold [puppet] - 10https://gerrit.wikimedia.org/r/824775 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [18:28:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32621 and previous config saved to /var/cache/conftool/dbconfig/20220819-182835-marostegui.json [18:28:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:28:40] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [18:29:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:29:49] (03CR) 10Dzahn: [C: 03+1] gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:37:26] 10SRE, 10Znuny, 10serviceops: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Arnoldokoth) Hey @Kormat I was wondering if this was resolved. I noticed the file now references some vrts passwords. ` # less modules/profile/templates/mariadb/grants/production-... [18:37:38] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:37:48] (03PS3) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 [18:38:26] (03PS4) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 [18:40:03] (03CR) 10BBlack: "Nice work, and I apologize on behalf of both the VCL and C languages!" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:43:20] (03CR) 10BBlack: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:05:37] (03PS1) 10Eevans: eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776 [19:18:52] (03CR) 10Dzahn: "let's try again after previous changes now removed the LVS setup" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [19:22:20] (03CR) 10Dzahn: [C: 04-1] "nope. now new issue from using systemd::sysuser. Duplicate declaration: Group[phd] is already declared" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [19:39:32] (03CR) 10BCornwall: [C: 03+1] utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [19:40:05] (03CR) 10BCornwall: [C: 03+1] "Assuming the linting will be fixed, +1" [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [19:43:30] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) 05Open→03Resolved [19:44:45] (03PS1) 10Dzahn: phabricator: avoid duplicate declaration mixing group{} and systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360) [19:49:38] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10TheresNoTime) [20:18:41] (03PS1) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [20:19:10] (03CR) 10AOkoth: [C: 03+1] gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:19:56] (03CR) 10AOkoth: [C: 03+1] gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:20:07] (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:22:02] (03Abandoned) 10Bking: bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [20:22:59] (03PS2) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [20:23:42] (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:23:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:23:50] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:24:06] (03CR) 10Bking: [C: 03+2] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) (owner: 10Ryan Kemper) [20:24:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:42] (03CR) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott) [20:34:44] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:07] (03PS1) 10Bking: bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) [21:04:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36851/" [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:05:32] (03CR) 10Dzahn: [C: 04-1] "https://gerrit.wikimedia.org/r/824782" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [21:05:55] (03PS4) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 [21:06:39] (03CR) 10Dzahn: [C: 03+2] "noop on prod hosts" [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:07:40] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36852/" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [21:08:41] (03PS3) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [21:09:30] (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [21:14:27] (03PS4) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [21:15:27] (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [21:16:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Confirmed: Service Request 149361152 was successfully submitted. [21:16:41] (03PS5) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [21:25:08] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) [21:26:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) 05Open→03Resolved Complete [21:29:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:56] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [21:46:18] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 19 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:02:04] (03PS1) 10BCornwall: WIP: No idea what I'm doing [puppet] - 10https://gerrit.wikimedia.org/r/824793 [22:05:20] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10BCornwall) @BBlack I've uploaded a patch that vaguely resembles what this ticket wants, but I have a few questions: 1. Am I even in the right galaxy with t... [22:06:42] (03CR) 10Dzahn: [C: 03+2] "finally..it seems: https://puppet-compiler.wmflabs.org/pcc-worker1002/36852/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [22:11:39] (03CR) 10Dzahn: [C: 03+2] "unbelievable.. yet another problem and you can't even find it in compiler. "Found 1 dependency cycle"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [22:12:07] (03CR) 10Dzahn: [C: 03+2] "Error: Found 1 dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [22:13:58] Error: Found 1 dependency cycle: [22:14:00] (Exec[Refresh sysusers] => User[scap] => Exec[bootstrap-scap-target] => Class[Scap] => Scap::Target[phabricator/deployment] => Package[phabricator/deployment] => Class[Phabricator::Phd] => Systemd::Sysuser[phd] => File[/etc/sysusers.d/phd.conf] => Exec[Refresh sysusers]) [22:14:05] "great" [22:23:46] (03CR) 10Dzahn: R:systemd::sysuser: drop managehome parameter as it dosn;t work (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [22:24:22] (03CR) 10Dzahn: "see inline comment. I am getting a circular dependency issue when I have 2 system users created in one role on a new host:" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [22:31:04] (03PS1) 10Dzahn: phabricator: don't use systemd::sysuser on phab2002 for now [puppet] - 10https://gerrit.wikimedia.org/r/824796 (https://phabricator.wikimedia.org/T313360) [22:34:55] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) I am applying a role on a new node and it creates 2 separate system users with systemd::sysuser (scap and phd) now. In the compiler everything seemed fine but t... [22:42:55] (03CR) 10Dzahn: [C: 03+2] "we still want this and UID 920 but we also don't want to leave puppet broken and we do want to know if the rest of the role works or there" [puppet] - 10https://gerrit.wikimedia.org/r/824796 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:48:48] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:51:05] (03CR) 10Dzahn: [C: 03+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/824796 it mostly works but still an issue to be solved with the sshd config. it" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [22:53:06] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:31] ACKNOWLEDGEMENT - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service daniel_zahn new host, bug, debugging https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:57] ^ very annoyed but it's not in prod yet and i'll debug it [22:56:21] it's because phab hosts have more than one sshd .. .. [22:56:27] or had [22:56:56] just saying because "ssh.service" looks kind of like the worst that could cause the systemd state alert [22:59:58] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:02] PROBLEM - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:02:20] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:51] !log phab2002 - disable puppet, fix sshd_config, restart sshd [23:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:24] RECOVERY - SSH on phab2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:04:12] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:26] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:32] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:43] (03PS1) 10Dzahn: phabricator: fix sshd listen address for phab codfw [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597) [23:21:00] RECOVERY - Check systemd state on mw2397 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:26] (03PS1) 10Dzahn: phabricator: move vcs and LVS settings from common to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) [23:24:36] (03CR) 10Dzahn: "also fixes the TODO to de-duplicate this stuff.. better.. just remove it all" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:29:45] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36853/" [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:32:31] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on phab1001 (prod), 2001. fixed / re-enabled puppet and sshd on 2002" [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:33:48] !log phab2002 - re-enabled puppet, sshd config ListenAddress fixed by puppet gerrit:824797 - now has phabricator prod role but without LVS/git-ssh - no more error in puppet run - T280597 [23:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:52] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [23:35:37] !log phab2002 - service phd: stopped phabricator_logmail: disabled, phabricator dumps: disabled, systemd::sysuser: not used (all via Hiera switches) - T280597 [23:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:37] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on phab2002.codfw.wmnet with reason: new host in setup [23:37:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on phab2002.codfw.wmnet with reason: new host in setup [23:41:00] (03CR) 10Dzahn: [C: 04-1] "this exposes there is still work to do to remove remnants of VCS in a clean way:" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:43:36] (03CR) 10Dzahn: [C: 04-1] "it's because in phabricator::main profile we say the IPs have a default value of "undef" but ALSO the type is an IP and that's not optiona" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:47:02] (03PS1) 10Dzahn: phabricator: make IP addresses for vcs optional parameters [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597) [23:49:54] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:49:54] (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/pcc-worker1002/36855/" [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:52:31] jouncebot: poke jenkins [23:54:05] (03CR) 10Dzahn: [C: 03+2] "noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:56:08] (03CR) 10Dzahn: [C: 03+2] "works after https://gerrit.wikimedia.org/r/c/operations/puppet/+/824800" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:56:45] (03CR) 10Dzahn: [C: 03+2] "removes phab2001-vcs.codfw.wmnet., git-ssh.codfw.wikimedia.org IPv4 and IPv6 IP parameters from new phab host 2002" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:58:32] (03CR) 10Dzahn: [C: 03+2] "noop on all" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)