[00:17:19] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:37] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:37] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:27] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) I can confirm this behaviour. When using systemd::sysuser on a new host it does not create the home dir. (I just started using this for phab hosts and the phd user...
[00:52:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:54:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:56:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:56:57] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:00:11] <wikibugs>	 (03PS1) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427)
[01:00:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[01:02:09] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:07:48] <wikibugs>	 (03PS1) 10Tim Starling: SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274)
[01:08:31] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274) (owner: 10Tim Starling)
[01:24:45] <wikibugs>	 (03Merged) 10jenkins-bot: SqlBagOStuff: use cancelAtomic() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824434 (https://phabricator.wikimedia.org/T315274) (owner: 10Tim Starling)
[01:29:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:30:59] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/libs/rdbms/database/DBConnRef.php: fix potential mainstash exception file 1 T315274 (duration: 03m 21s)
[01:31:03] <stashbot>	 T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). - https://phabricator.wikimedia.org/T315274
[01:31:59] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:36:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:37:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:37:59] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix potential mainstash exception file 2 T315274 (duration: 03m 30s)
[01:38:03] <stashbot>	 T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). - https://phabricator.wikimedia.org/T315274
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:08] <wikibugs>	 (03PS2) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427)
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:54:01] <icinga-wm>	 PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:56:05] <wikibugs>	 (03CR) 10Tim Starling: "Puppet compiler result: https://puppet-compiler.wmflabs.org/pcc-worker1001/36833/" [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[02:03:19] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:04:09] <wikibugs>	 10SRE-OnFire, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Wikimedia-Incident, 10Wikimedia-production-error: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomi... - https://phabricator.wikimedia.org/T315274
[02:06:43] <icinga-wm>	 RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:07] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:47] <wikibugs>	 (03PS6) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434
[02:23:23] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:37:21] <wikibugs>	 (03CR) 10Ori: "PS 6: moved normalize_request_nonmisc below vcl_init. Otherwise we get a 'Symbol not found: cache_local' error from VCC-Compiler. This is " [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori)
[03:23:49] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:28:55] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:53] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:17] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:03] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:43:05] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:27] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:05] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[03:57:27] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:26:02] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@09eb565]: zuul: Fix/remove links to non-existent Grafana graphs - T307405
[04:26:07] <stashbot>	 T307405: Broken dashboard links on Zuul Status page - https://phabricator.wikimedia.org/T307405
[04:26:15] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@09eb565]: zuul: Fix/remove links to non-existent Grafana graphs - T307405 (duration: 00m 13s)
[04:39:39] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:47] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:11] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:53:53] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:00:59] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:15:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:20:01] <marostegui>	 !log Install 10.6.9 on db2122 and db2146
[05:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:07] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:24:09] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2177-db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824583 (https://phabricator.wikimedia.org/T311494)
[05:25:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2177-db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824583 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[05:27:21] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:28:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[05:28:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[05:29:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32543 and previous config saved to /var/cache/conftool/dbconfig/20220819-052900-marostegui.json
[05:29:04] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[05:31:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32544 and previous config saved to /var/cache/conftool/dbconfig/20220819-053110-marostegui.json
[05:31:35] <wikibugs>	 (03PS1) 10Marostegui: db2164: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584
[05:33:01] <wikibugs>	 (03PS2) 10Marostegui: db2152: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584
[05:34:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2152: Not future sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/824584 (owner: 10Marostegui)
[05:36:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2181 [puppet] - 10https://gerrit.wikimedia.org/r/824585 (https://phabricator.wikimedia.org/T311494)
[05:37:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2181 [puppet] - 10https://gerrit.wikimedia.org/r/824585 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[05:46:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P32546 and previous config saved to /var/cache/conftool/dbconfig/20220819-054616-marostegui.json
[05:48:25] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:54:06] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411)
[05:57:41] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:07] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P32547 and previous config saved to /var/cache/conftool/dbconfig/20220819-060122-marostegui.json
[06:06:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui)
[06:06:33] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Bump version to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/824586 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui)
[06:12:37] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:15:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P32548 and previous config saved to /var/cache/conftool/dbconfig/20220819-061515-root.json
[06:16:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312972)', diff saved to https://phabricator.wikimedia.org/P32549 and previous config saved to /var/cache/conftool/dbconfig/20220819-061628-marostegui.json
[06:16:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:16:33] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:16:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:16:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32550 and previous config saved to /var/cache/conftool/dbconfig/20220819-061649-marostegui.json
[06:18:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: (oblivian) add helper functions to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/824588
[06:21:23] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:24:22] <wikibugs>	 (03CR) 10Hashar: [V: 03+1] "I have tested it locally with Gerrit 3.4 and this does not alter the rendering. There is no voteChip class and the styling is still done b" [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[06:30:55] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: (oblivian) add helper functions to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/824588 (owner: 10Giuseppe Lavagetto)
[06:39:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32551 and previous config saved to /var/cache/conftool/dbconfig/20220819-063903-marostegui.json
[06:39:08] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:54:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P32552 and previous config saved to /var/cache/conftool/dbconfig/20220819-065409-marostegui.json
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220819T0700)
[07:03:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10cmooney) Hi @Papaul   My apologies that's due to me, as soon as I can get https://gerrit.wikimedia.org/r/c/operations/dns/+/824572 merged I'll sort it.
[07:08:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[07:09:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P32553 and previous config saved to /var/cache/conftool/dbconfig/20220819-070916-marostegui.json
[07:10:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[07:10:54] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup)
[07:11:25] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup)
[07:11:28] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[07:12:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Ladsgroup)
[07:18:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[07:19:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[07:19:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:19:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:19:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32555 and previous config saved to /var/cache/conftool/dbconfig/20220819-071934-ladsgroup.json
[07:19:38] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[07:20:33] <Amir1>	 !log killing cswiki's refreshlinksrecom script T299021
[07:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:36] <stashbot>	 T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021
[07:24:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312972)', diff saved to https://phabricator.wikimedia.org/P32556 and previous config saved to /var/cache/conftool/dbconfig/20220819-072422-marostegui.json
[07:24:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:24:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:24:27] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[07:24:39] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:28:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312972)', diff saved to https://phabricator.wikimedia.org/P32557 and previous config saved to /var/cache/conftool/dbconfig/20220819-072800-marostegui.json
[07:32:23] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824682 (https://phabricator.wikimedia.org/T311494)
[07:34:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824682 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[07:43:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P32558 and previous config saved to /var/cache/conftool/dbconfig/20220819-074306-marostegui.json
[07:46:29] <wikibugs>	 (03CR) 10Marostegui: "This looks good, but I would like to ask Jaime for his thoughts too" [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[07:58:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P32559 and previous config saved to /var/cache/conftool/dbconfig/20220819-075812-marostegui.json
[08:04:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Factor out x2 per-host hieradata into an objectstash role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[08:06:16] <wikibugs>	 (03PS1) 10Phuedx: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685
[08:11:54] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Add user eevans to ops group [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans)
[08:12:45] <wikibugs>	 (03PS2) 10Phuedx: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685
[08:13:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312972)', diff saved to https://phabricator.wikimedia.org/P32561 and previous config saved to /var/cache/conftool/dbconfig/20220819-081317-marostegui.json
[08:13:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:13:22] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[08:13:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:13:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:13:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:13:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32562 and previous config saved to /var/cache/conftool/dbconfig/20220819-081356-marostegui.json
[08:15:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:15:50] <icinga-wm>	 PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:16:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32563 and previous config saved to /var/cache/conftool/dbconfig/20220819-081606-marostegui.json
[08:16:43] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2067.codfw.wmnet
[08:16:44] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2067.codfw.wmnet
[08:18:12] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:21:25] <wikibugs>	 (03PS1) 10MVernon: swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049)
[08:26:24] <icinga-wm>	 PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:31:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:31:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P32564 and previous config saved to /var/cache/conftool/dbconfig/20220819-083112-marostegui.json
[08:31:24] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:33:59] <wikibugs>	 (03PS2) 10MMandere: utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536)
[08:34:20] <icinga-wm>	 PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere)
[08:35:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon)
[08:38:20] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db2181 and db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824687 (https://phabricator.wikimedia.org/T311494)
[08:38:43] <wikibugs>	 (03CR) 10MMandere: utils: Add latency measurement program (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere)
[08:38:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I would probably add a TODO to go back to the current form once we've got rid of buster, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:40:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:40:06] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36835/console" [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans)
[08:40:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2181 and db2182 [puppet] - 10https://gerrit.wikimedia.org/r/824687 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[08:40:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:40:35] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36835/console" [puppet] - 10https://gerrit.wikimedia.org/r/824567 (owner: 10Eevans)
[08:40:52] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: ms-be2067/sdc1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/824686 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon)
[08:40:54] <wikibugs>	 (03PS3) 10Filippo Giunchedi: docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229)
[08:42:44] <wikibugs>	 (03PS3) 10Filippo Giunchedi: service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229)
[08:43:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:43:35] <wikibugs>	 (03PS2) 10Filippo Giunchedi: postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229)
[08:44:00] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:44:01] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[08:44:05] <wikibugs>	 (03PS1) 10Clément Goubert: icinga: add cgoubert to the right groups in icinga [puppet] - 10https://gerrit.wikimedia.org/r/824689
[08:44:50] <icinga-wm>	 RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:45:52] <icinga-wm>	 RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:46:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P32565 and previous config saved to /var/cache/conftool/dbconfig/20220819-084618-marostegui.json
[08:46:46] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "np, response inline" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse)
[08:47:44] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36836/console" [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert)
[08:50:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) (owner: 10Ryan Kemper)
[08:51:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert)
[08:51:35] <wikibugs>	 (03CR) 10Jbond: "duplicat https://gerrit.wikimedia.org/r/c/operations/puppet/+/824568" [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[08:52:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:phabricator: move common settings to role hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond)
[08:52:49] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] pcc: Encode jenkins username to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/824209 (owner: 10Clément Goubert)
[08:52:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/824572 (https://phabricator.wikimedia.org/T315429) (owner: 10Cathal Mooney)
[08:53:14] <wikibugs>	 (03CR) 10Jbond: phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[08:54:45] <wikibugs>	 (03PS1) 10ArielGlenn: add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736)
[08:55:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:56:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[08:56:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[08:57:50] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:59:09] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:01:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312972)', diff saved to https://phabricator.wikimedia.org/P32566 and previous config saved to /var/cache/conftool/dbconfig/20220819-090124-marostegui.json
[09:01:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:01:29] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[09:01:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:01:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32567 and previous config saved to /var/cache/conftool/dbconfig/20220819-090146-marostegui.json
[09:02:03] <wikibugs>	 (03CR) 10Hashar: doc: properly redirect back compat URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar)
[09:02:28] <icinga-wm>	 RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:03:34] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1014 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:04:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32568 and previous config saved to /var/cache/conftool/dbconfig/20220819-090456-marostegui.json
[09:05:43] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:07:30] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) p:05Triage→03Low
[09:09:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:09:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) a:03cmooney
[09:10:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add include statement for 2001:df2:e500:fe07::/64 reverse entries [dns] - 10https://gerrit.wikimedia.org/r/824572 (https://phabricator.wikimedia.org/T315429) (owner: 10Cathal Mooney)
[09:11:43] <topranks>	 !log running authdns-update on auth1001 to add new include to 0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa. zone
[09:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:20] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:14:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:14:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:16:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[09:16:58] <wikibugs>	 (03PS5) 10FNegri: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:17:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:18:56] <wikibugs>	 (03PS4) 10FNegri: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:19:52] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:20:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P32569 and previous config saved to /var/cache/conftool/dbconfig/20220819-092002-marostegui.json
[09:20:29] <wikibugs>	 (03PS5) 10FNegri: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:20:43] <wikibugs>	 (03PS5) 10FNegri: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:20:50] <wikibugs>	 (03PS5) 10FNegri: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:20:57] <wikibugs>	 (03PS5) 10FNegri: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:21:03] <wikibugs>	 (03PS5) 10FNegri: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:21:09] <wikibugs>	 (03PS7) 10FNegri: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:21:27] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:21:53] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:22:05] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:22:25] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:22:36] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:22:47] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:23:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[09:26:05] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:26:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10cmooney) 05Open→03Resolved Ok got the +1 and merged.  All is good now.  ` cmooney@cumin1001:~$ dig  +short A kubernetes2024.codfw.wmnet @ns0.wikimedia.org 10.192.48.87 cmooney...
[09:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:27:08] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:24] <wikibugs>	 (03Merged) 10jenkins-bot: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:28:24] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] cloud: reformat cloud.yaml with prettier [puppet] - 10https://gerrit.wikimedia.org/r/824421 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:29:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney)
[09:33:07] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:34:49] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:34:51] <wikibugs>	 (03Merged) 10jenkins-bot: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:35:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P32570 and previous config saved to /var/cache/conftool/dbconfig/20220819-093508-marostegui.json
[09:36:40] <wikibugs>	 (03PS1) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692
[09:39:20] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:40:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez)
[09:41:26] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[09:45:14] <wikibugs>	 (03PS1) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196)
[09:46:58] <wikibugs>	 (03PS1) 10Btullis: Add dummy tokens for dse_k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824695 (https://phabricator.wikimedia.org/T310196)
[09:47:00] <wikibugs>	 (03PS1) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568)
[09:47:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy tokens for dse_k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824695 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:49:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36838/console" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[09:50:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32571 and previous config saved to /var/cache/conftool/dbconfig/20220819-095014-marostegui.json
[09:50:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[09:50:19] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[09:50:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[09:50:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32572 and previous config saved to /var/cache/conftool/dbconfig/20220819-095035-marostegui.json
[09:51:24] <wikibugs>	 (03PS2) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692
[09:53:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add new admin_ng values for the dse-k8s-eqiad cluster (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:55:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez)
[09:55:57] <vgutierrez>	 uhl... that's passing locally 
[09:56:41] <vgutierrez>	 oh course... I offended pep8 with a 102 chars long line.. I should be executed right now
[09:57:56] <wikibugs>	 (03PS3) 10Vgutierrez: mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692
[10:00:58] <wikibugs>	 (03PS1) 10Btullis: get_cert [puppet] - 10https://gerrit.wikimedia.org/r/824697
[10:02:59] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229)
[10:03:01] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[10:03:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: WIP dispatch: add database role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[10:05:05] <wikibugs>	 (03PS2) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196)
[10:05:31] <wikibugs>	 (03Abandoned) 10Btullis: get_cert [puppet] - 10https://gerrit.wikimedia.org/r/824697 (owner: 10Btullis)
[10:06:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32573 and previous config saved to /var/cache/conftool/dbconfig/20220819-100633-ladsgroup.json
[10:06:38] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:13:08] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:13:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32574 and previous config saved to /var/cache/conftool/dbconfig/20220819-101348-marostegui.json
[10:13:53] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[10:21:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P32575 and previous config saved to /var/cache/conftool/dbconfig/20220819-102139-ladsgroup.json
[10:27:52] <wikibugs>	 (03PS1) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196)
[10:28:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P32576 and previous config saved to /var/cache/conftool/dbconfig/20220819-102854-marostegui.json
[10:32:44] <wikibugs>	 (03PS2) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196)
[10:34:00] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:36:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P32577 and previous config saved to /var/cache/conftool/dbconfig/20220819-103645-ladsgroup.json
[10:37:10] <wikibugs>	 (03CR) 10Btullis: "I'm not sure exactly which infrastructure_users I should add the to dse_k8s block here, so I have taken best guess." [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[10:44:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P32578 and previous config saved to /var/cache/conftool/dbconfig/20220819-104400-marostegui.json
[10:48:44] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] icinga: add cgoubert to the right groups in icinga [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert)
[10:51:37] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: disable shipping logs to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/824703
[10:51:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T314041)', diff saved to https://phabricator.wikimedia.org/P32579 and previous config saved to /var/cache/conftool/dbconfig/20220819-105151-ladsgroup.json
[10:51:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:51:55] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:52:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32580 and previous config saved to /var/cache/conftool/dbconfig/20220819-105212-ladsgroup.json
[10:54:40] <wikibugs>	 (03PS1) 10Btullis: Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172)
[10:54:48] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:56:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[10:59:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32581 and previous config saved to /var/cache/conftool/dbconfig/20220819-105906-marostegui.json
[10:59:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:59:10] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[10:59:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:59:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:59:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:59:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32582 and previous config saved to /var/cache/conftool/dbconfig/20220819-105934-marostegui.json
[11:01:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32583 and previous config saved to /var/cache/conftool/dbconfig/20220819-110145-marostegui.json
[11:02:40] <wikibugs>	 (03CR) 10Btullis: Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[11:11:36] <wikibugs>	 (03PS2) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568)
[11:15:41] <wikibugs>	 (03PS3) 10Jbond: R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568)
[11:16:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P32584 and previous config saved to /var/cache/conftool/dbconfig/20220819-111651-marostegui.json
[11:17:14] <icinga-wm>	 PROBLEM - Check systemd state on db2114 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:10] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] icinga: add cgoubert to the right groups in icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824689 (owner: 10Clément Goubert)
[11:20:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: drop managehome parameter as it dosn;t work [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[11:31:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P32586 and previous config saved to /var/cache/conftool/dbconfig/20220819-113157-marostegui.json
[11:32:54] <wikibugs>	 (03PS1) 10Jbond: C:vopsbot: correct data dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/824713
[11:34:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:vopsbot: correct data dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/824713 (owner: 10Jbond)
[11:42:23] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10jbond)
[11:43:55] <wikibugs>	 (03CR) 10Jbond: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[11:47:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312972)', diff saved to https://phabricator.wikimedia.org/P32587 and previous config saved to /var/cache/conftool/dbconfig/20220819-114703-marostegui.json
[11:47:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:47:08] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[11:47:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) Ok so I got some results back.  Firstly 10,000 pings to cr1-drmrs from bast2002, starting at 08:14 UTC.  Average RTT was 118ms, worst was 154ms: `...
[11:47:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:47:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[11:47:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[11:47:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 11 hosts with reason: Maintenance
[11:48:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 11 hosts with reason: Maintenance
[11:49:20] <wikibugs>	 (03PS4) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870)
[11:49:46] <wikibugs>	 (03CR) 10David Caro: p:ceph::osd: get the os disks by size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[11:50:04] <icinga-wm>	 RECOVERY - Check systemd state on db2114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:52] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10dcaro) That seemed to do the trick yes! Thanks!
[11:56:02] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:59:40] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10fgiunchedi) Thank you for the followup! LGTM and working as expected now
[12:21:26] <wikibugs>	 (03PS1) 10Btullis: Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196)
[12:23:33] <wikibugs>	 (03PS2) 10Krinkle: Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz)
[12:24:02] <wikibugs>	 (03CR) 10Btullis: "I have updated the documentation a little here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#File_cfssl-issuer-values.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[12:28:20] <wikibugs>	 (03PS1) 10Btullis: Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196)
[12:32:43] <wikibugs>	 (03PS3) 10Btullis: Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196)
[12:35:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[12:37:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz)
[12:39:11] <wikibugs>	 (03Merged) 10jenkins-bot: Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) (owner: 10Aaron Schulz)
[12:43:01] <wikibugs>	 (03PS1) 10Jelto: gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463)
[12:44:46] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I0c45b657d9ee7efe (duration: 03m 24s)
[12:45:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:46:43] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36842/console" [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:46:45] <wikibugs>	 (03PS2) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196)
[12:49:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:49:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:50:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:55:16] <wikibugs>	 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle)
[12:59:30] <wikibugs>	 10SRE, 10serviceops: Move "redis_sessions" to "redis_misc" cluster - https://phabricator.wikimedia.org/T280586 (10Krinkle) 05Open→03Declined Declining as it has been obsoleted. With T314453 done, the last consumer is gone from this. There are now no references left in wmf-config to the redis_sessions clust...
[12:59:34] <wikibugs>	 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle)
[13:05:03] <wikibugs>	 (03PS1) 10Krinkle: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581)
[13:05:24] <wikibugs>	 (03PS2) 10Krinkle: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581)
[13:08:42] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 6 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lydia_Pintscher)
[13:09:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:09:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:10:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
[13:10:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
[13:10:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[13:11:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[13:11:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:11:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:11:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32588 and previous config saved to /var/cache/conftool/dbconfig/20220819-131139-marostegui.json
[13:11:44] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[13:11:58] <wikibugs>	 (03PS1) 10Krinkle: Remove references to now-empty redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824736 (https://phabricator.wikimedia.org/T267581)
[13:12:00] <wikibugs>	 (03PS1) 10Krinkle: redis: Remove now-empty and unreferenced redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824737 (https://phabricator.wikimedia.org/T267581)
[13:14:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32589 and previous config saved to /var/cache/conftool/dbconfig/20220819-131359-marostegui.json
[13:16:05] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez)
[13:19:21] <wikibugs>	 (03PS1) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463)
[13:21:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail:atsbackend: Provide full histograms for cache read/write time [puppet] - 10https://gerrit.wikimedia.org/r/824692 (owner: 10Vgutierrez)
[13:25:04] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] P:systemd::timesyncd: allow overriding the protectsystem systemd param [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:27:32] <wikibugs>	 (03PS2) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463)
[13:28:36] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "Looks reasonably safe, and probably safer than what we have now. ;)" [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:29:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32590 and previous config saved to /var/cache/conftool/dbconfig/20220819-132905-marostegui.json
[13:30:34] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:31:20] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T315344 (10Cmjohnson) 05Open→03Declined well aware of this
[13:32:06] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: dbprov1002 lost power redundancy - https://phabricator.wikimedia.org/T315439 (10Cmjohnson) this is a loose power cable, I was in this rack trying to adjust power because it's alerting. I will fix this today.
[13:34:10] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36844/console" [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[13:37:22] <wikibugs>	 (03PS1) 10Jdrewniak: Add back fixed width to main content [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653)
[13:44:05] <jan_drewniak>	 👋 happy Friday folks,  it looks like the Web team has a bit of an emergency deploy situation on our hands https://phabricator.wikimedia.org/T315653 
[13:44:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32591 and previous config saved to /var/cache/conftool/dbconfig/20220819-134411-marostegui.json
[13:45:06] <marostegui>	 !log Install 10.4.26 on db2111 db2148 db2124
[13:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:31] <RhinosF1>	 jouncebot: now
[13:45:31] <jouncebot>	 For the next 17 hour(s) and 14 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220819T0700)
[13:46:35] <RhinosF1>	 thcipriani, dancy, ^demon: please see jan_drewniak's comment as Deployment/Emergencies requires releng
[13:55:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox DNS changes are not updating - https://phabricator.wikimedia.org/T315630 (10Papaul) @cmooney thank you my changes are now on the DNS server.
[13:56:35] <wikibugs>	 (03PS1) 10Jforrester: TranslatableBundleLogFormatter: Cast reason to string before passing it [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657)
[13:57:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED
[13:59:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312972)', diff saved to https://phabricator.wikimedia.org/P32592 and previous config saved to /var/cache/conftool/dbconfig/20220819-135917-marostegui.json
[13:59:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[13:59:22] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[13:59:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[13:59:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:59:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:59:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32593 and previous config saved to /var/cache/conftool/dbconfig/20220819-135956-marostegui.json
[14:01:14] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32594 and previous config saved to /var/cache/conftool/dbconfig/20220819-140216-marostegui.json
[14:04:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED
[14:13:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2024']
[14:17:20] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:17:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32595 and previous config saved to /var/cache/conftool/dbconfig/20220819-141722-marostegui.json
[14:19:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2024']
[14:21:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul)
[14:22:55] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs_backup_instances.yaml: move all VM backups to new cloudbackup hosts [puppet] - 10https://gerrit.wikimedia.org/r/824747 (https://phabricator.wikimedia.org/T302535)
[14:23:55] <wikibugs>	 (03CR) 10Herron: "Very nice!  LGTM pending followup on Filippo's comments" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[14:25:53] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10ori)
[14:27:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_instances.yaml: move all VM backups to new cloudbackup hosts [puppet] - 10https://gerrit.wikimedia.org/r/824747 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott)
[14:28:36] <wikibugs>	 (03PS1) 10Papaul: Add kubernetes202[34] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824748 (https://phabricator.wikimedia.org/T313870)
[14:29:08] <wikibugs>	 (03PS3) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196)
[14:30:36] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add kubernetes202[34] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824748 (https://phabricator.wikimedia.org/T313870) (owner: 10Papaul)
[14:32:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32596 and previous config saved to /var/cache/conftool/dbconfig/20220819-143228-marostegui.json
[14:32:33] <wikibugs>	 (03CR) 10Herron: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:33:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2023.codfw.wmnet with OS bullseye
[14:34:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2023.codfw.wmnet with OS bullseye
[14:40:50] <dancy>	 jan_drewniak: I am available to help w/ the emergency deployment 
[14:45:13] <jan_drewniak>	 dancy: that's would be super great, it's a one line CSS fix but out layout is kinda broken without it https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/824441 we didn't notice earlier because of the train rollback 
[14:45:26] <dancy>	 ok.. If you're ready we can do it now.
[14:45:44] <jan_drewniak>	 dancy: that would be great!
[14:45:59] <dancy>	 Starting
[14:46:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak)
[14:47:10] <wikibugs>	 (03CR) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[14:47:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312972)', diff saved to https://phabricator.wikimedia.org/P32597 and previous config saved to /var/cache/conftool/dbconfig/20220819-144734-marostegui.json
[14:47:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:47:39] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[14:47:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:47:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32598 and previous config saved to /var/cache/conftool/dbconfig/20220819-144755-marostegui.json
[14:48:43] <wikibugs>	 (03CR) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[14:48:46] <logmsgbot>	 !log dancy@deploy1002 backport aborted:  (duration: 03m 01s)
[14:48:51] <dancy>	 ^(ignore that)
[14:50:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak)
[14:50:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32599 and previous config saved to /var/cache/conftool/dbconfig/20220819-145015-marostegui.json
[14:52:26] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824751 (https://phabricator.wikimedia.org/T305175)
[14:52:28] <wikibugs>	 (03PS1) 10Cwhite: logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175)
[14:52:30] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175)
[14:52:32] <wikibugs>	 (03PS1) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175)
[14:52:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2024.codfw.wmnet with OS bullseye
[14:52:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2024.codfw.wmnet with OS bullseye
[14:53:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[14:54:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[14:54:42] <wikibugs>	 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) 05Open→03Resolved All the Netbox entries deleted
[14:55:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage
[14:56:16] <wikibugs>	 (03PS2) 10Cwhite: beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175)
[14:56:41] <wikibugs>	 (03PS2) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175)
[14:59:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage
[15:00:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Krinkle)
[15:00:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Krinkle)
[15:04:04] <wikibugs>	 (03PS4) 10Cwhite: tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861)
[15:04:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add back fixed width to main content [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824441 (https://phabricator.wikimedia.org/T315653) (owner: 10Jdrewniak)
[15:04:30] <wikibugs>	 (03CR) 10Cwhite: tcpircbot: send !log events to log stream (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[15:04:53] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:824441|Add back fixed width to main content (T315653)]]
[15:04:57] <stashbot>	 T315653: Regression: fixed width broken on Vector (2022) - https://phabricator.wikimedia.org/T315653
[15:05:05] <jan_drewniak>	 dancy: 15 min later, finally merged :P 
[15:05:21] <dancy>	 yeah.. :-/
[15:05:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32600 and previous config saved to /var/cache/conftool/dbconfig/20220819-150521-marostegui.json
[15:05:52] <dancy>	 jan_drewniak: Alright.  Your change is on mwdebug.  Test is out
[15:05:54] <dancy>	 *it
[15:07:12] <wikibugs>	 (03PS3) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500)
[15:07:58] <jan_drewniak>	 dancy: yup! that's definitely fixed! good to sync :) 
[15:08:08] <dancy>	 Excellent. Proceeding
[15:10:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32601 and previous config saved to /var/cache/conftool/dbconfig/20220819-151053-ladsgroup.json
[15:10:58] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[15:11:02] <wikibugs>	 10SRE, 10Performance-Team, 10Thumbor, 10Sustainability (Incident Followup): Lower per-IP PoolCounter throttling Thumbor settings - https://phabricator.wikimedia.org/T252426 (10Krinkle)
[15:11:52] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:824441|Add back fixed width to main content (T315653)]] (duration: 06m 59s)
[15:11:56] <stashbot>	 T315653: Regression: fixed width broken on Vector (2022) - https://phabricator.wikimedia.org/T315653
[15:12:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:12:09] <dancy>	 jan_drewniak: Done
[15:12:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage
[15:12:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:12:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:13:26] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10Krinkle)
[15:13:44] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Krinkle)
[15:14:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2023.codfw.wmnet with OS bullseye
[15:14:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2023.codfw.wmnet with OS bullseye completed: - kub...
[15:14:26] <jan_drewniak>	 dancy: excellent! thank you so much! and sorry for bugging you on a friday :P 
[15:14:31] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[15:14:33] <dancy>	 No problem
[15:16:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage
[15:16:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:20:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32602 and previous config saved to /var/cache/conftool/dbconfig/20220819-152027-marostegui.json
[15:23:06] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.14.0" for 556 hosts
[15:24:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul)
[15:25:02] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.14.0" completed for 556 hosts
[15:25:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites - https://phabricator.wikimedia.org/T260449 (10Krinkle)
[15:26:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32603 and previous config saved to /var/cache/conftool/dbconfig/20220819-152559-ladsgroup.json
[15:26:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Sustainability (Incident Followup): clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 (10Krinkle)
[15:27:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2024.codfw.wmnet with OS bullseye
[15:28:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2024.codfw.wmnet with OS bullseye completed: - kub...
[15:35:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32604 and previous config saved to /var/cache/conftool/dbconfig/20220819-153533-marostegui.json
[15:35:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[15:35:38] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[15:35:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[15:35:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32605 and previous config saved to /var/cache/conftool/dbconfig/20220819-153554-marostegui.json
[15:36:54] <thcipriani>	 thanks for handling the emergency deploy dancy !
[15:37:10] <dancy>	 👍🏾
[15:37:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32606 and previous config saved to /var/cache/conftool/dbconfig/20220819-153714-marostegui.json
[15:37:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul)
[15:38:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) 05Open→03Resolved @akosiaris all yours
[15:41:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32607 and previous config saved to /var/cache/conftool/dbconfig/20220819-154105-ladsgroup.json
[15:43:54] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[15:52:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P32608 and previous config saved to /var/cache/conftool/dbconfig/20220819-155220-marostegui.json
[15:56:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32609 and previous config saved to /var/cache/conftool/dbconfig/20220819-155611-ladsgroup.json
[15:56:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[15:56:16] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[15:56:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[16:07:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P32610 and previous config saved to /var/cache/conftool/dbconfig/20220819-160726-marostegui.json
[16:10:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) p:05High→03Low Thanks yep case opened with JTAC now will keep it open to document any information they may provide.
[16:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32611 and previous config saved to /var/cache/conftool/dbconfig/20220819-162232-marostegui.json
[16:22:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:22:37] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[16:22:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:22:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32612 and previous config saved to /var/cache/conftool/dbconfig/20220819-162253-marostegui.json
[16:25:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32613 and previous config saved to /var/cache/conftool/dbconfig/20220819-162513-marostegui.json
[16:29:22] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:30:23] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis)
[16:31:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[16:35:27] <wikibugs>	 (03CR) 10Herron: [C: 03+1] tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[16:38:55] <wikibugs>	 (03PS1) 10Majavah: P:openstack::codfw1dev::db: support TLS [puppet] - 10https://gerrit.wikimedia.org/r/824764 (https://phabricator.wikimedia.org/T310795)
[16:38:57] <wikibugs>	 (03PS1) 10Majavah: P:openstack::codfw1dev::db: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/824765
[16:40:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32614 and previous config saved to /var/cache/conftool/dbconfig/20220819-164019-marostegui.json
[16:41:43] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy infrastructure_users for dse-k8s cluster (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/824699 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[16:46:48] <wikibugs>	 (03PS1) 10Btullis: Add a dummy certificate for dse_k8s [labs/private] - 10https://gerrit.wikimedia.org/r/824767 (https://phabricator.wikimedia.org/T310196)
[16:51:52] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy certificate for dse_k8s [labs/private] - 10https://gerrit.wikimedia.org/r/824767 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[16:52:47] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36848/console" [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[16:55:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32615 and previous config saved to /var/cache/conftool/dbconfig/20220819-165525-marostegui.json
[16:59:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::codfw1dev::db: support TLS [puppet] - 10https://gerrit.wikimedia.org/r/824764 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah)
[17:10:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312972)', diff saved to https://phabricator.wikimedia.org/P32616 and previous config saved to /var/cache/conftool/dbconfig/20220819-171031-marostegui.json
[17:10:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:10:37] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[17:10:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:10:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32617 and previous config saved to /var/cache/conftool/dbconfig/20220819-171052-marostegui.json
[17:15:20] <wikibugs>	 (03PS1) 10Isaac Johnson: Addition of Varnish logic for setting include_pv cookie on x-analytics and WMF-DP on client. [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676)
[17:15:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Addition of Varnish logic for setting include_pv cookie on x-analytics and WMF-DP on client. [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[17:16:54] <wikibugs>	 (03CR) 10Isaac Johnson: "Differential Privacy patch" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[17:22:51] <wikibugs>	 (03PS2) 10Isaac Johnson: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676)
[17:23:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[17:34:45] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10RobH) @Papaul,  Please note we've ordered (2) replacement SSDs, and the one sourced from Amazon will ship directly to your home (due to Amazon/USPS not being able to deliver to datacenters.)  Please test out the...
[17:36:20] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:42:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) Thank you!  We'll need {T314156} before we can proceed.  Thanks!
[17:43:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32618 and previous config saved to /var/cache/conftool/dbconfig/20220819-174317-marostegui.json
[17:43:22] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[17:44:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10CDanis) Looks like this is resolved...?
[17:50:36] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[17:58:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32619 and previous config saved to /var/cache/conftool/dbconfig/20220819-175823-marostegui.json
[18:07:58] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) thanks @jbond, ack. works for me:)
[18:13:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32620 and previous config saved to /var/cache/conftool/dbconfig/20220819-181329-marostegui.json
[18:14:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: update style for Gerrit 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[18:21:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[18:27:48] <wikibugs>	 (03PS1) 10CDanis: bump excess concurrency tracking threshold [puppet] - 10https://gerrit.wikimedia.org/r/824775 (https://phabricator.wikimedia.org/T306580)
[18:28:08] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] bump excess concurrency tracking threshold [puppet] - 10https://gerrit.wikimedia.org/r/824775 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[18:28:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T312972)', diff saved to https://phabricator.wikimedia.org/P32621 and previous config saved to /var/cache/conftool/dbconfig/20220819-182835-marostegui.json
[18:28:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[18:28:40] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[18:29:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[18:29:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[18:37:26] <wikibugs>	 10SRE, 10Znuny, 10serviceops: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Arnoldokoth) Hey @Kormat I was wondering if this was resolved. I noticed the file now references some vrts passwords. ` # less modules/profile/templates/mariadb/grants/production-...
[18:37:38] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:37:48] <wikibugs>	 (03PS3) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636
[18:38:26] <wikibugs>	 (03PS4) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859
[18:40:03] <wikibugs>	 (03CR) 10BBlack: "Nice work, and I apologize on behalf of both the VCL and C languages!" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:43:20] <wikibugs>	 (03CR) 10BBlack: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[19:05:37] <wikibugs>	 (03PS1) 10Eevans: eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776
[19:18:52] <wikibugs>	 (03CR) 10Dzahn: "let's try again after previous changes now removed the LVS setup" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[19:22:20] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "nope. now new issue from using systemd::sysuser. Duplicate declaration: Group[phd] is already declared" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[19:39:32] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere)
[19:40:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Assuming the linting will be fixed, +1" [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere)
[19:43:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) 05Open→03Resolved
[19:44:45] <wikibugs>	 (03PS1) 10Dzahn: phabricator: avoid duplicate declaration mixing group{} and systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360)
[19:49:38] <wikibugs>	 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10TheresNoTime)
[20:18:41] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[20:19:10] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[20:19:56] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[20:20:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[20:22:02] <wikibugs>	 (03Abandoned) 10Bking: bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[20:22:59] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[20:23:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[20:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[20:23:50] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:23:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[20:24:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) (owner: 10Ryan Kemper)
[20:24:14] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:26:26] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:27:42] <wikibugs>	 (03CR) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott)
[20:34:44] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:50:07] <wikibugs>	 (03PS1) 10Bking: bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604)
[21:04:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36851/" [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:05:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://gerrit.wikimedia.org/r/824782" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[21:05:55] <wikibugs>	 (03PS4) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636
[21:06:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on prod hosts" [puppet] - 10https://gerrit.wikimedia.org/r/824782 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:07:40] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36852/" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[21:08:41] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[21:09:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[21:14:27] <wikibugs>	 (03PS4) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[21:15:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[21:16:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Confirmed: Service Request 149361152 was successfully submitted.
[21:16:41] <wikibugs>	 (03PS5) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[21:25:08] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:25:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul)
[21:26:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) 05Open→03Resolved Complete
[21:29:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:43:56] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[21:46:18] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 19 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[22:02:04] <wikibugs>	 (03PS1) 10BCornwall: WIP: No idea what I'm doing [puppet] - 10https://gerrit.wikimedia.org/r/824793
[22:05:20] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10BCornwall) @BBlack I've uploaded a patch that vaguely resembles what this ticket wants, but I have a few questions:  1. Am I even in the right galaxy with t...
[22:06:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "finally..it seems: https://puppet-compiler.wmflabs.org/pcc-worker1002/36852/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[22:11:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "unbelievable.. yet another problem and you can't even find it in compiler. "Found 1 dependency cycle"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[22:12:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Error: Found 1 dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[22:13:58] <mutante>	 Error: Found 1 dependency cycle:
[22:14:00] <mutante>	 (Exec[Refresh sysusers] => User[scap] => Exec[bootstrap-scap-target] => Class[Scap] => Scap::Target[phabricator/deployment] => Package[phabricator/deployment] => Class[Phabricator::Phd] => Systemd::Sysuser[phd] => File[/etc/sysusers.d/phd.conf] => Exec[Refresh sysusers])
[22:14:05] <mutante>	 "great"
[22:23:46] <wikibugs>	 (03CR) 10Dzahn: R:systemd::sysuser: drop managehome parameter as it dosn;t work (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[22:24:22] <wikibugs>	 (03CR) 10Dzahn: "see inline comment. I am getting a circular dependency issue when I have 2 system users created in one role on a new host:" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[22:31:04] <wikibugs>	 (03PS1) 10Dzahn: phabricator: don't use systemd::sysuser on phab2002 for now [puppet] - 10https://gerrit.wikimedia.org/r/824796 (https://phabricator.wikimedia.org/T313360)
[22:34:55] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) I am applying a role on a  new node and it creates 2 separate system users with systemd::sysuser (scap and phd) now.   In the compiler everything seemed fine but t...
[22:42:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "we still want this and UID 920 but we also don't want to leave puppet broken and we do want to know if the rest of the role works or there" [puppet] - 10https://gerrit.wikimedia.org/r/824796 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[22:48:48] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:51:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/824796 it mostly works but still an issue to be solved with the sshd config. it" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn)
[22:53:06] <icinga-wm>	 PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:31] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service daniel_zahn new host, bug, debugging https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:55:57] <mutante>	 ^ very annoyed but it's not in prod yet and i'll debug it
[22:56:21] <mutante>	 it's because phab hosts have more than one sshd .. ..
[22:56:27] <mutante>	 or had
[22:56:56] <mutante>	 just saying because "ssh.service" looks kind of like the worst that could cause the systemd state alert
[22:59:58] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:02] <icinga-wm>	 PROBLEM - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:02:20] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:51] <mutante>	 !log phab2002 - disable puppet, fix sshd_config, restart sshd
[23:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:24] <icinga-wm>	 RECOVERY - SSH on phab2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:04:12] <icinga-wm>	 RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:26] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:32] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:43] <wikibugs>	 (03PS1) 10Dzahn: phabricator: fix sshd listen address for phab codfw [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597)
[23:21:00] <icinga-wm>	 RECOVERY - Check systemd state on mw2397 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:26] <wikibugs>	 (03PS1) 10Dzahn: phabricator: move vcs and LVS settings from common to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597)
[23:24:36] <wikibugs>	 (03CR) 10Dzahn: "also fixes the TODO to de-duplicate this stuff.. better.. just remove it all" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:29:45] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36853/" [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:32:31] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on phab1001 (prod), 2001. fixed / re-enabled puppet and sshd on 2002" [puppet] - 10https://gerrit.wikimedia.org/r/824797 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:33:48] <mutante>	 !log phab2002 - re-enabled puppet, sshd config ListenAddress fixed by puppet gerrit:824797 - now has phabricator prod role but without LVS/git-ssh - no more error in puppet run - T280597
[23:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:52] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[23:35:37] <mutante>	 !log phab2002 - service phd: stopped  phabricator_logmail: disabled,   phabricator dumps: disabled,  systemd::sysuser: not used (all via Hiera switches)  - T280597
[23:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:37] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on phab2002.codfw.wmnet with reason: new host in setup
[23:37:52] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on phab2002.codfw.wmnet with reason: new host in setup
[23:41:00] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this exposes there is still work to do to remove remnants of VCS in a clean way:" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:43:36] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "it's because in phabricator::main profile we say the IPs have a default value of "undef" but ALSO the type is an IP and that's not optiona" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:47:02] <wikibugs>	 (03PS1) 10Dzahn: phabricator: make IP addresses for vcs optional parameters [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597)
[23:49:54] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:49:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/pcc-worker1002/36855/" [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:52:31] <mutante>	 jouncebot: poke jenkins
[23:54:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/824800 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:56:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "works after https://gerrit.wikimedia.org/r/c/operations/puppet/+/824800" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:56:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "removes phab2001-vcs.codfw.wmnet., git-ssh.codfw.wikimedia.org IPv4 and IPv6 IP parameters from new phab host 2002" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:58:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on all" [puppet] - 10https://gerrit.wikimedia.org/r/824798 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)