[00:02:17] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:07] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:08:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:11:39] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) a:03Legoktm
[00:12:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:15:55] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:20:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:22:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:25:29] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:40:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:43:33] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:05] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:47:01] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:47:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:52:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:56:46] <wikibugs>	 (03PS1) 10Legoktm: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636
[00:56:48] <wikibugs>	 (03PS1) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637
[00:56:50] <wikibugs>	 (03PS1) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857)
[00:58:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[01:00:10] <wikibugs>	 (03PS2) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637
[01:00:12] <wikibugs>	 (03PS2) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857)
[01:00:38] <legoktm>	 rsync::quickdatacopy is indented with 6 spaces, and it's totally throwing my editor off
[01:01:57] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:06] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30910/console" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[01:18:08] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30911/console" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[01:18:57] <wikibugs>	 (03CR) 10Legoktm: "The one thing I'm not sure of is where I'm supposed to set the rsync::server::wrap_with_stunnel hiera." [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[01:25:06] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "This doesn't work how I want for deployment::rsync because in that module we have the IPs of the hosts, not the actual fqdns. We could eit" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[01:25:25] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Fails PCC because of the comment I just left on Change-Id: I3964a58b736892f5f7d978606d7b80cb5b3e0ddf" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[01:25:55] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:13] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) I'm guessing no one has done this until now because deployment::rsync was using hand-rolled rsync + timer rather than quickdatacopy. I gave it...
[01:38:22] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Legoktm) {T289857} has some notes on how to enable stunnel for this. However the #mw-on-k8s image building process also performs an rsync against the releases host, so it might also n...
[01:59:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:00:04] <jouncebot>	 Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T0200)
[02:01:25] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:06:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643
[02:06:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643 (owner: 10TrainBranchBot)
[02:08:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:25:33] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:00] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643 (owner: 10TrainBranchBot)
[02:31:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:01:15] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:26:25] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:17] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:12:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:15:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:16:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:41] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/715523
[04:26:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:53:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:02:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:02:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:02:47] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:17:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:26:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:51] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:51] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:48:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:49:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:53:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/715523 (owner: 10Marostegui)
[05:55:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17109 and previous config saved to /var/cache/conftool/dbconfig/20210831-055546-root.json
[05:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:51] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:01:53] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:06:27] <marostegui>	 !log Rename flaggedrevs_stats2 and flaggedrevs_stats on dewiki codfw T289050
[06:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:31] <stashbot>	 T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050
[06:07:27] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:07:47] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17110 and previous config saved to /var/cache/conftool/dbconfig/20210831-061049-root.json
[06:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:54] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:19:27] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:25:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17111 and previous config saved to /var/cache/conftool/dbconfig/20210831-062553-root.json
[06:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:59] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:29:15] <wikibugs>	 (03CR) 10Volans: [C: 03+2] quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans)
[06:29:46] <wikibugs>	 (03Merged) 10jenkins-bot: quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans)
[06:29:51] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:30:39] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 610 ge 480 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[06:34:47] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:05] <icinga-wm>	 PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:38:33] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[06:38:37] <icinga-wm>	 PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100%
[06:38:51] <icinga-wm>	 PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100%
[06:39:09] <icinga-wm>	 PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:40:15] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[06:40:15] <icinga-wm>	 PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:40:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17112 and previous config saved to /var/cache/conftool/dbconfig/20210831-064056-root.json
[06:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:02] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:41:47] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/HTTPS
[06:42:03] <icinga-wm>	 PROBLEM - Host cp5016.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:42:07] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:42:45] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5015 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 511519 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2021-11-16 23:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS
[06:43:03] <icinga-wm>	 PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:45:38] <elukey>	 rack down in eqsin?
[06:45:46] <elukey>	 https://netbox.wikimedia.org/dcim/racks/78/
[06:46:33] <elukey>	 weird I can ssh to cp5003 and ping other cp nodes
[06:47:01] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:47] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:57] <majavah>	 elukey: maybe some network link down causing only partial unavailability?
[06:48:30] <elukey>	 I am checking what's wrong
[06:48:33] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:49:16] <elukey>	 from cp5011 I can ping icinga.wikimedia.org only via v6
[06:50:18] <elukey>	 XioNoX, topranks - around?
[06:50:19] <wikibugs>	 (03PS2) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187
[06:50:43] <XioNoX>	 yo
[06:50:47] <wikibugs>	 (03CR) 10Majavah: toolforge: remove portgrabber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah)
[06:50:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah)
[06:51:27] <wikibugs>	 (03PS3) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187
[06:51:46] <elukey>	 XioNoX: hello! I am not sure what's happening, but it seems that icinga fails to reach (via ipv4 afaics) a rack in eqsin
[06:52:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:52:35] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:52:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:53:06] <XioNoX>	 looking
[06:53:28] <elukey>	 XioNoX: better - I wasn't able to ping from one of the affected cp nodes to icinga.wikimedia.org via v4, but I just tried from alert2001 and both works
[06:53:34] <elukey>	 (v4 and v6)
[06:53:43] <elukey>	 I don't see Varnish traffic issues and I can ssh to nodes
[06:54:03] <XioNoX>	 looks like the eqsin-codfw link flapped
[06:55:51] <elukey>	 but I thought we had the alternative path via ulsfo, I expected some recovery
[06:56:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17113 and previous config saved to /var/cache/conftool/dbconfig/20210831-065600-root.json
[06:56:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:06] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:56:07] <XioNoX>	 yeah we do
[06:56:29] <elukey>	 trying to force a recheck in icinga
[06:57:37] <XioNoX>	 elukey: IPv4 still doesn't fully go through
[06:58:03] <XioNoX>	 I'm going to drain the telia link
[06:58:08] <elukey>	 ack thanks
[06:58:34] <elukey>	 I just noticed from cp5003 that I can reach alert2001 but not 1001 via v4
[06:58:58] <elukey>	 and  traceroute says cr1-codfw
[06:59:16] <elukey>	 v6 goes through ulsfo, ok it makes sense
[07:00:51] <XioNoX>	 yeah it's silently dropping traffic but not dropping BFD/OSPF...
[07:00:55] <XioNoX>	 already happened in the past
[07:00:58] <elukey>	 lovely
[07:01:29] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:01:45] <XioNoX>	 !log drain eqsin-codfw link
[07:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:57] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 232.86 ms
[07:01:57] <icinga-wm>	 RECOVERY - Host cp5003 is UP: PING OK - Packet loss = 0%, RTA = 232.39 ms
[07:01:57] <icinga-wm>	 RECOVERY - Host cp5011 is UP: PING WARNING - Packet loss = 80%, RTA = 319.29 ms
[07:02:01] <icinga-wm>	 RECOVERY - Host cp5014 is UP: PING OK - Packet loss = 0%, RTA = 232.48 ms
[07:02:03] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 302.19 ms
[07:02:33] <elukey>	 goooood
[07:02:41] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) @Ottomata that looks unrelated to your chance (but related to yours @Jelto ). We will take a look!
[07:03:04] <elukey>	 thanks XioNoX and majavah 
[07:03:12] <XioNoX>	 elukey: thank you
[07:03:16] <XioNoX>	 I'll follow up with Telia
[07:03:27] <icinga-wm>	 RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.95 ms
[07:03:27] <icinga-wm>	 RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 246.40 ms
[07:03:27] <icinga-wm>	 RECOVERY - Host cp5016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 251.24 ms
[07:03:27] <icinga-wm>	 RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.37 ms
[07:04:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "I'm going to merge this but not do a release until we either start building a grid on a newer Debian release or need to make one for other" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah)
[07:05:26] <wikibugs>	 (03Merged) 10jenkins-bot: Do not compare OS versions [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah)
[07:06:15] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:07:17] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:56] <wikibugs>	 (03PS1) 10DCausse: query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517)
[07:12:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:31] <wikibugs>	 (03PS1) 10JMeybohm: kube_env: Error out of user has no read permission to kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/715698
[07:12:46] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[07:15:51] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:46] <jinxer-wm>	 (Traffic on tunnel link) firing: (2) Traffic on tunnel link   - https://alerts.wikimedia.org
[07:22:43] <ema>	 elukey, XioNoX: thanks! I see a 5xx blip in eqsin between 6:38 and 6:45, nothing else 
[07:23:19] <ema>	 https://w.wiki/3zMC
[07:26:37] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:59] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] Add Varnish SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema)
[07:32:46] <jinxer-wm>	 (Traffic on tunnel link) firing: (2) Traffic on tunnel link   - https://alerts.wikimedia.org
[07:32:59] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:33:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:37:46] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[07:39:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)
[07:40:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)
[07:44:27] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:43] <marostegui>	 !log Optimize ruwiki.flaggedtemplates T290057
[07:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:48] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[07:45:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:47:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10ema) @herron: I've merged the patch, forced a puppet run on grafana1002.eqiad.wmnet, and followed the instructions at https://wikitech.wik...
[07:48:18] <wikibugs>	 (03PS1) 10Majavah: P::toolforge::apt_pinning: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/715700
[07:52:04] <wikibugs>	 (03PS1) 10Majavah: toolforge: drop legacy webservice endpoints on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/715701
[07:56:44] <wikibugs>	 (03CR) 10Kosta Harlan: bullseye-sssd: Add openssh-client (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan)
[08:01:14] <wikibugs>	 10SRE, 10DNS, 10Traffic: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10fgiunchedi) p:05Triage→03Medium
[08:02:47] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: Add bgwiki (Bethany) to the list of privileged ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo)
[08:05:09] <marostegui>	 !log Optimize plwiktionary.flaggedtemplates T290057
[08:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:14] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[08:06:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Open→03Resolved This is complete, thank you all!
[08:09:13] <jynus>	 ^ godog, as I said on the patch, that is missing the actual group change
[08:10:11] <godog>	 jynus: doh, I missed it
[08:11:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Resolved→03Open It wasn't resolved, still pending a group change
[08:11:39] <godog>	 jynus: so ok to add to wmf correct?
[08:11:57] <jynus>	 yep, only was waiting on her updating her email on wikitech
[08:12:00] <jynus>	 which was done
[08:13:13] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:14:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:14:46] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived (due to banlist) - https://phabricator.wikimedia.org/T289928 (10Aklapper)
[08:15:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <WMF> for <Bethany> - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Open→03Resolved We're all done, please verify access @Bethany !
[08:18:13] <marostegui>	 !log Optimize cewiki.flaggedtemplates T290057
[08:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:19] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[08:19:17] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot)
[08:22:10] <wikibugs>	 (03CR) 10Abijeet Patro: "Sorry, in favor of: I2a0adb06199c1b3d818a8fce7d80769f0c503948" [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot)
[08:25:22] <wikibugs>	 (03CR) 10Abijeet Patro: [V: 03+2] "Looks good." [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/715477 (owner: 10L10n-bot)
[08:25:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:15] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Make customized Mailman3 templates translatable - https://phabricator.wikimedia.org/T282018 (10abi_)
[08:32:12] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) 05Open→03Resolved We have the necessary permissions now. Exports are working properly. Resolving this ta...
[08:38:09] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:38:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 130, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:39:40] <marostegui>	 !log Optimize plwiki.flaggedtemplates T290057
[08:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:45] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[08:55:09] <wikibugs>	 (03PS1) 10Majavah: P::toolforge::redis_sentinel: Block REPLICAOF too [puppet] - 10https://gerrit.wikimedia.org/r/715703
[08:59:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Good catch" [puppet] - 10https://gerrit.wikimedia.org/r/715703 (owner: 10Majavah)
[09:01:59] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:57] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:37] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:41] <wikibugs>	 (03CR) 10Michael Große: "T235292 has been adjusted to also include P360" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE))
[09:26:41] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:38] <wikibugs>	 10SRE, 10DNS, 10Traffic: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) I guess you also need a proper CAA record to authorize AWS CA to issue certs for learn.wiki
[09:44:59] <wikibugs>	 (03PS1) 10Vgutierrez: learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025)
[09:50:05] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) From https://docs.aws.amazon.com/acm/latest/userguide/setup-caa.html it looks like any of amazon.com, amazontrust.com, awstrust.com or amazonaws.com would do it as...
[10:02:55] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:54] <wikibugs>	 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10TheDJ) Just a thank you to Tim and Lego for working on this for all that time. I know its been quite a bit o...
[10:10:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] "nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715552 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:11:50] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet
[10:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:18] <wikibugs>	 (03CR) 10Ema: [C: 03+1] learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025) (owner: 10Vgutierrez)
[10:12:23] <wikibugs>	 (03PS2) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921)
[10:12:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025) (owner: 10Vgutierrez)
[10:14:51] <marostegui>	 !log Optimize huwiki.flaggedtemplates T290057
[10:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:56] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[10:16:53] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@carrot:~$ host -t CAA learn.wiki learn.wiki has CAA record 0 issue "letsencrypt.org" learn.wiki has CAA record 0...
[10:16:59] <wikibugs>	 (03PS3) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921)
[10:17:03] <wikibugs>	 (03CR) 10Ladsgroup: Set permission of creating short url to everyone everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup)
[10:17:10] <wikibugs>	 (03PS1) 10Hnowlan: aqs_next: use same druid datasource as aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/715712
[10:17:37] <wikibugs>	 (03PS3) 10Jbond: confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:18:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:18:41] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanka a lot @hnowlan :)" [puppet] - 10https://gerrit.wikimedia.org/r/715712 (owner: 10Hnowlan)
[10:23:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet
[10:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:27] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet
[10:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] aqs_next: use same druid datasource as aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/715712 (owner: 10Hnowlan)
[10:25:47] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:28:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:32:39] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10jcrespo) FYI I have now started backups of commonswiki with only 4 read threads on eqiad. So far I've seen no impact on latency, and not even on the total amount of reads/s (it is very serial, s...
[10:38:44] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master
[10:38:46] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master
[10:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:42] <wikibugs>	 (03CR) 10MVernon: "Hi," [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi)
[10:56:02] <wikibugs>	 (03CR) 10MVernon: Fix dnspython 2 compat (031 comment) [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi)
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1100)
[11:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:24] <MatmaRex>	 hello
[11:00:29] <urbanecm>	 I can deploy today
[11:00:32] <urbanecm>	 Hello MatmaRex 
[11:00:56] <wikibugs>	 (03PS3) 10Urbanecm: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński)
[11:01:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński)
[11:02:01] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński)
[11:03:50] <urbanecm>	 MatmaRex: your patch is at mwdebug2001, can you have a look?
[11:04:15] <MatmaRex>	 yep
[11:04:56] <MatmaRex>	 seems good
[11:05:03] <MatmaRex>	 tested at kowiki
[11:05:12] <urbanecm>	 syncing
[11:06:47] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eb482e3fa88a87166b990fd9b87d0ccbbf971290: Offer the DiscussionTools reply tool as opt-out setting at 21 phase 2 Wikipedias (T288483) (duration: 00m 57s)
[11:06:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10ayounsi) I removed the management routers from the wrong alert, that's why we got paged again. It's now fixed so it won't page wh...
[11:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:52] <stashbot>	 T288483: Deploy config to make Reply Tool available as opt-out at phase 2 wikis - https://phabricator.wikimedia.org/T288483
[11:06:55] <urbanecm>	 MatmaRex: here you go!
[11:06:58] <urbanecm>	 anything else?
[11:07:30] <MatmaRex>	 that's all, thanks
[11:07:36] <urbanecm>	 any time :)
[11:07:57] <urbanecm>	 !log EU B&C window done
[11:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:20] <urbanecm>	 or actually...
[11:08:33] <wikibugs>	 (03PS1) 10Urbanecm: updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971)
[11:08:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm)
[11:08:57] <urbanecm>	 let's get this out too
[11:09:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond)
[11:21:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Backup of commonswiki started, around 70K files backed up (slowly) so far:   ` root@db1176.eqiad.wmnet[mediabackups]>...
[11:26:57] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:19] <wikibugs>	 (03Merged) 10jenkins-bot: updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm)
[11:28:24] <urbanecm>	 \o
[11:31:42] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/updateMenteeData.php: 53a1856128edb4ec3a5ea8840fb6755a1703f7ac: updateMenteeData: Send timing to statsd (T278971) (duration: 00m 57s)
[11:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:48] <stashbot>	 T278971: Mentor dashboard: M1 mentee overview module  - https://phabricator.wikimedia.org/T278971
[11:31:48] * urbanecm done for real
[11:32:24] <wikibugs>	 (03PS1) 10Hnowlan: maps1009: remove temporary overrides [puppet] - 10https://gerrit.wikimedia.org/r/715721
[11:33:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:19] <wikibugs>	 10SRE, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) a:03jcrespo It's happening: https://www.youtube.com/watch?v=imbGdfzckrI See T262668#7321326 for details.
[11:38:38] <jynus>	 there was now a latency spike on swift, but it is recent
[11:39:20] <wikibugs>	 (03PS1) 10Urbanecm: mediawiki/maintenance/growthexperiments.pp: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971)
[11:40:08] <jynus>	 was something deployed at around 11:26?
[11:40:58] <urbanecm>	 jynus: https://sal.toolforge.org/log/30T6m3sB8Fs0LHO5r1aT, but that has zero chance to do anything to swift
[11:41:13] <jynus>	 that's what I would thought
[11:41:30] <jynus>	 maybe a normal cache thing or something? or something else
[11:41:43] <urbanecm>	 or a requests spike?
[11:42:05] <jynus>	 that is actually quite constant
[11:42:13] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 (owner: 10MSantos)
[11:42:34] <jynus>	 but it could be traffic related, yes
[11:42:45] <jynus>	 I will check cache graphs
[11:43:38] <urbanecm>	 or your commonswiki backup maybe jynus?
[11:43:59] <jynus>	 that is what I wanted to know, but I've been running it for over an hour
[11:44:11] <jynus>	 and this is only from :26
[11:44:16] <urbanecm>	 i see
[11:44:26] <jynus>	 there is some unavailablity on ulsfo and upload
[11:44:31] <jynus>	 starting at that time
[11:44:43] <jynus>	 but that could be just a consequence of increased latency
[11:46:26] <jynus>	 there is an increase in network io
[11:46:44] <jynus>	 2 spikes
[11:48:14] <jynus>	 https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=17&orgId=1&from=1630388886051&to=1630410426051&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops
[11:50:15] <jynus>	 things seem back to normal
[11:51:45] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 563 ge 480 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[12:01:50] <godog>	 I'll take a look at the indexing errors
[12:02:01] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:47] <godog>	 looks like it was a spike
[12:08:11] <wikibugs>	 (03PS1) 10Jbond: admin: droprequire [puppet] - 10https://gerrit.wikimedia.org/r/715724 (https://phabricator.wikimedia.org/T263578)
[12:09:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: droprequire [puppet] - 10https://gerrit.wikimedia.org/r/715724 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[12:10:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10dr0ptp4kt) @jcrespo we have not allocated a wikimedia.org email address. Is that required? If so, I'll ask ITS to provision one. I'm out the rest of the day, heads up.
[12:15:28] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Dzahn) All files sent to releases are meant to be available to the world though. Does it still matter to encrypt traffic internally for something like this?
[12:25:31] <wikibugs>	 (03PS1) 10Dzahn: load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538)
[12:26:27] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:27] <wikibugs>	 (03PS1) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[12:30:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[12:32:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:39:22] <wikibugs>	 (03PS1) 10Dzahn: admin: create a group to run the wmf-auto-reimage cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/715729
[12:40:01] <wikibugs>	 (03PS2) 10Dzahn: admin: create a group to run the wmf-auto-reimage cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/715729
[12:42:34] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue
[12:42:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue
[12:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:36] <wikibugs>	 (03CR) 10Dzahn: "Try Stdlib::Host instead of ::Fqdn. That would cover both IPs and hosts and I have been told by other reviewers to use that instead in oth" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[12:45:00] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "FYI there is https://phabricator.wikimedia.org/T289779 with a slightly more generic approach that doesn't apply to this specific use case." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[12:45:44] <wikibugs>	 (03PS2) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[12:46:12] <wikibugs>	 (03PS3) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[12:48:54] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] "I was with the impression we were going to re-enable for codfw, eqiad needs to have those set until the new mapping is applied with a new " [puppet] - 10https://gerrit.wikimedia.org/r/715721 (owner: 10Hnowlan)
[12:50:05] <wikibugs>	 (03CR) 10Dzahn: "Yes, thank you, though this is specifically meant to be a quick fix that can be removed again as soon as we have better generic solutions." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[12:51:41] <wikibugs>	 (03PS3) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729
[12:52:00] <jelto>	 !log run kubectl scale deployments.apps -n ci mediawiki-bruce --replicas=0 to stop ImagePulling and reduce io on kubestage1001
[12:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[12:54:35] <wikibugs>	 (03Merged) 10jenkins-bot: load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[12:59:05] <urbanecm>	 !log [urbanecm@mwmaint2002 ~]$ sudo -u www-data kill 133282 # stop updateMenteeData.php at frwiki
[12:59:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[13:02:43] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[13:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:42] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#7320889, @JMeybohm wrote: > @Ottomata that looks unrelated to your chance (but related to yours @Jelto ). We will take a l...
[13:06:23] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[13:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:09] <wikibugs>	 (03PS1) 10Jbond: admin: create new sre-admins group to match the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779)
[13:10:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30914/console" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[13:11:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: create new sre-admins group to match the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[13:15:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30915/console" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[13:19:48] <wikibugs>	 (03PS4) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:20:55] <wikibugs>	 (03PS1) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779)
[13:21:35] <wikibugs>	 (03PS5) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:21:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30916/console" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[13:21:59] <wikibugs>	 (03PS6) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:22:07] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) IMHO yes, we should encrypt traffic unless we have reasons not to (e.g. system is going to be retired, too hard/complex to implement vs advantages, etc)
[13:22:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:22:55] <wikibugs>	 (03PS7) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:23:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:23:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:24:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:25:06] <wikibugs>	 (03PS8) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:26:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30918/console" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:27:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "I have added the sre-admins group and updated this CR to use that instead.  LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:28:08] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1010.eqiad.wmnet
[13:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:05] <wikibugs>	 (03CR) 10Dzahn: "heh, thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:32:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746)
[13:34:50] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:52] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "this is option b) from https://phabricator.wikimedia.org/T289746#7311563 . lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi)
[13:36:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan)
[13:37:43] <urbanecm>	 !log Start `mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --wiki=nlwiki --verbose` in a tmux session at mwmaint2002
[13:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "New users just need to be aware of the caveat. It's possible to login both with and without capitalization (we just slapped apache auth in" [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi)
[13:39:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please add a PCC run too" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[13:40:19] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) Great!  Proceeding...
[13:40:24] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:28] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[13:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:15] <wikibugs>	 (03CR) 10Jbond: "see inlines for nits" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[13:43:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:45:48] <wikibugs>	 (03CR) 10Dzahn: "We have a process how to add new members to existing groups but not really for how to create new admin groups. Let's get a +1 from Wolfgan" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:47:06] <jbond>	 !log disable puppet fleet wide to preform puppetdb maintance T263578
[13:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:12] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[13:53:16] <icinga-wm>	 PROBLEM - Host puppetdb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:54:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30919/console" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe)
[13:54:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM, thank you for taking care of this. I'll deploy it next week when I'm off clinic duty" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe)
[13:55:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_puppetdb site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:58:22] <icinga-wm>	 RECOVERY - Host puppetdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms
[13:59:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi)
[14:00:32] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi)
[14:01:46] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[14:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:04] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on puppetdb1002.eqiad.wmnet with reason: puppetdb maintance - T289779
[14:02:06] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on puppetdb1002.eqiad.wmnet with reason: puppetdb maintance - T289779
[14:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:08] <stashbot>	 T289779: Creat a new ldap group for sre users without root access - https://phabricator.wikimedia.org/T289779
[14:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintance - T289779
[14:02:23] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintance - T289779
[14:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:34] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[14:02:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:20] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[14:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) I was asked by SRE Infrastructure Foundations to ask you this, as a production alert has gone off because of this.
[14:03:57] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[14:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:16] <wikibugs>	 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10mepps) @Niharika Based on my read, it also looks like the 10 day delay would only be when there were holidays too. What's the next step...
[14:04:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:05:17] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[14:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:42] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[14:05:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the review!" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi)
[14:07:51] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_puppetdb site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:09:16] <wikibugs>	 (03PS5) 10Ottomata: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777
[14:09:21] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:36] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10Cmjohnson) 05Open→03Resolved replaced the cable
[14:09:59] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata)
[14:11:18] <icinga-wm>	 RECOVERY - Host cloudcephosd1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms
[14:12:19] <wikibugs>	 (03PS3) 10Ottomata: service_auto_restart - match full line when ensuring absent [puppet] - 10https://gerrit.wikimedia.org/r/697605
[14:15:45] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) I have added a new 100GB disk so that the system has enough space to preform the vacume.  this has meant doing the following * add new ga...
[14:16:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] service_auto_restart - match full line when ensuring absent [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata)
[14:16:25] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Tested in deployment-prep,  nothing bad happened..." [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata)
[14:16:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo)
[14:19:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) I think I can still approve these for Analytics access.  Approved!
[14:19:28] <ottomata>	 !log merged change to service_auto_restart.pp that changes the way service names are matched to be more explicit.  tested in deployment prep and nothing bad happened.  Logging in case something bad does happen in prod.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/697605
[14:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:47] <wikibugs>	 (03CR) 10Jcrespo: "Hey, @Ottomata, I wanted to update this based on our own manual (specially, as you were on vacations), but if the director delegates this " [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[14:22:39] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Ottomata) I just tried to run puppet on an-coord1001 but got:  ` Notice: Skipping run of Puppet configuration client; administratively disabled (Rea...
[14:23:36] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10RhinosF1) Puppet is under maintenance
[14:23:50] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Ottomata) Oh, sorry @jbond is doing some maintenance and referenced the wrong phab ticket.  Ignore ^
[14:25:26] <wikibugs>	 (03CR) 10Ottomata: "Thanks!  I'll ask Olja what she thinks.  I'm likely to be a faster reviewer on phab than she is.  Could we put both of our names there?" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[14:29:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:29:48] <hashar>	 !log Restarting CI Jenkins for plugins upgrade
[14:29:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:59] <jbond>	 !log enable puppet fleet wide to post preform puppetdb maintance T263578
[14:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:05] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[14:30:45] <jbond>	 ottomata: fyi ^^^ puppet should be enabled again
[14:31:32] <ottomata>	 ty
[14:31:36] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:53:18] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005)
[14:53:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] "Overall looks ok (by my eyes are not very experienced), some nits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[14:53:50] <wikibugs>	 (03CR) 10Ahmon Dancy: icinga: add dancy,thcipriani,hashar to icinga authorized service/host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi)
[14:54:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[14:55:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[14:56:04] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005)
[14:56:25] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) https://www.2ndquadrant.com/en/blog/postgresql-vacuum-and-analyze-best-practice-tips/ has some good advice on autovacum settings, this is...
[14:56:31] <wikibugs>	 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10herron) Thanks @ema!  This is helpful feedback  >>! In T289036#7320951, @ema wrote: > The diff step, `grr diff dashboardname`, is unclear to me. What is dashboa...
[15:05:06] <wikibugs>	 (03PS3) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005)
[15:05:33] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata)
[15:06:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove ocg remnant [labs/private] - 10https://gerrit.wikimedia.org/r/715744
[15:06:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: (WIP): Unify kubernetes users to automate user creation [labs/private] - 10https://gerrit.wikimedia.org/r/715745
[15:07:00] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving! Feel free to reopen
[15:07:34] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30927/console" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:07:39] <wikibugs>	 (03PS1) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919)
[15:09:15] <wikibugs>	 (03PS2) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919)
[15:09:57] <wikibugs>	 (03PS3) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919)
[15:10:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey)
[15:11:15] <wikibugs>	 (03PS4) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[15:13:39] <wikibugs>	 (03PS4) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919)
[15:14:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Cmjohnson) 05Open→03Resolved Corrected all the duplicate cable ID's in eqiad.
[15:15:35] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) row A in eqiad has been updated
[15:17:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) I am not sure what needs to be done with this task.  There really isn't anything actionable other than to replace the scs with something else.
[15:17:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10Cmjohnson)
[15:18:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10Cmjohnson) 05Open→03Resolved thanks @ayounsi there is another task for duplicate labels. That is all fixed.
[15:18:15] <wikibugs>	 (03PS5) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919)
[15:20:42] <wikibugs>	 (03PS1) 10Jbond: Gemfile: add sync as a dependency [puppet] - 10https://gerrit.wikimedia.org/r/715751
[15:23:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Gemfile: add sync as a dependency [puppet] - 10https://gerrit.wikimedia.org/r/715751 (owner: 10Jbond)
[15:24:51] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714564 (owner: 10PipelineBot)
[15:26:13] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] zuul: migrate cron of zuul_repack to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[15:27:46] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714564 (owner: 10PipelineBot)
[15:28:53] <wikibugs>	 (03PS2) 10Michael DiPietro: update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640
[15:29:20] <wikibugs>	 (03PS4) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005)
[15:30:05] <wikibugs>	 (03PS5) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[15:31:31] <wikibugs>	 (03PS6) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[15:32:24] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:36:35] <wikibugs>	 (03PS7) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578)
[15:37:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30931/console" [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[15:38:29] <wikibugs>	 (03PS10) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673)
[15:42:06] <wikibugs>	 (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[15:43:42] <wikibugs>	 (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/890/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[15:45:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:45:48] <wikibugs>	 (03PS3) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356)
[15:46:07] <wikibugs>	 (03CR) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[15:46:27] <wikibugs>	 (03CR) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[15:47:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:50:10] <wikibugs>	 (03PS1) 10Hnowlan: maps: disable sync on maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/715754
[15:52:08] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@09156c2]: fix core Title redirect loop
[15:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:38] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7264013, @Volans wrote: > - The `Admin::Hashuser` and `Admin::Hashgroup` seems to have tons of relations that I don't thin...
[15:53:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Cmjohnson)
[15:53:55] <wikibugs>	 (03CR) 10Effie Mouzeli: "I suggest we first split this patch into 2, chart updates and helmfile.d updates, and we can review it again." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[15:54:09] <wikibugs>	 10SRE, 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10Cmjohnson) 05Open→03Resolved  I am not sure if any of this is needed still but here is the info requeted.   There are currently 2 available network ports and 135power ports available in C8  1 available networ...
[15:54:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson)
[15:54:25] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk)
[15:55:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[16:00:05] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1600).
[16:00:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Please keep in mind that the staging cluster generally has limited resources 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse)
[16:00:37] <wikibugs>	 (03Abandoned) 10Kosta Harlan: bullseye-sssd: Add openssh-client [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan)
[16:04:57] <wikibugs>	 (03PS3) 10Dduvall: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504)
[16:05:16] <wikibugs>	 (03CR) 10Michael DiPietro: "https://puppet-compiler.wmflabs.org/compiler1001/30934/" [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro)
[16:05:58] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] "Looking for a merge if anyone has time. This is blocking my testing of the gitlab-runner profile." [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall)
[16:07:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall)
[16:07:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall)
[16:07:56] <dduvall>	 jbond, effie: simultaneous thanks! ^ :)
[16:08:01] <jbond>	 effie: i think i just beet you :P
[16:08:06] <wikibugs>	 (03PS1) 10Volans: pylint: remove unnecessary disable comments [cookbooks] - 10https://gerrit.wikimedia.org/r/715756
[16:08:10] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@09156c2]: fix core Title redirect loop (duration: 16m 02s)
[16:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:36] <effie>	 jbond: I was trying to understand how I +2'ed something and it got merged
[16:08:43] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Adjust memory limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse)
[16:08:46] <effie>	 I almost had a heart attack :p
[16:09:29] <jbond>	 dduvall: fyi if looking for a merge asking in #wikimedia-sre will normally find someone
[16:09:32] <jbond>	 :)
[16:09:47] <jbond>	 also fyi puppet also run on the apt servers
[16:10:00] <dduvall>	 right on. i'm always trying to find better more polite ways to hound people for merges :)
[16:10:57] <jbond>	 dduvall: and feel free to ping me if you still have no luck and its in the EU timezone.  failling everything else there is https://wikitech.wikimedia.org/wiki/Puppet_request_window
[16:11:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) I made a mistake by an order of magnitude, we have backed up approximately 2.5TB or half a million of files in less th...
[16:11:03] <jbond>	 and yes i bet effie :D 
[16:11:18] <effie>	 haha
[16:11:21] <dduvall>	 haha
[16:11:27] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: Adjust memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse)
[16:11:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Comments only, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/715756 (owner: 10Volans)
[16:12:08] <wikibugs>	 (03PS13) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[16:13:37] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) @Legoktm I'd like to check again because I may need to make a tweak in [[https://meta.wikimedia.org/wiki/Tech/Server_switch|t...
[16:14:28] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[16:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:55] <wikibugs>	 (03Merged) 10jenkins-bot: pylint: remove unnecessary disable comments [cookbooks] - 10https://gerrit.wikimedia.org/r/715756 (owner: 10Volans)
[16:17:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10lmata) Hi @Papaul is it possible to ask for Bullseye with this ticket? thanks!
[16:18:55] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[16:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10herron)
[16:23:08] <dduvall>	 jbond: hmm, i don't see a `gitlab-runner` component yet under https://apt.wikimedia.org/wikimedia/dists/buster-wikimedia/thirdparty/
[16:25:23] <jbond>	 dduvall: one sec i will need to run $something to do the initial sync
[16:25:38] <dduvall>	 ah, ok
[16:29:51] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)
[16:30:04] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)
[16:34:47] <jbond>	 dduvall: also missed this ^^ (which i should have spotted in review)
[16:34:51] <wikibugs>	 (03PS1) 10Jbond: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715761 (https://phabricator.wikimedia.org/T287504)
[16:34:57] <jbond>	 ^^ even :)
[16:35:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715761 (https://phabricator.wikimedia.org/T287504) (owner: 10Jbond)
[16:36:48] <jbond>	 ryankemper: fyi also merging b/files/ssl/wcqs.discovery.wmnet.crt
[16:37:48] <dduvall>	 jbond: oooh, ok. thanks for the follow-up patch
[16:37:49] <ryankemper>	 jbond: much appreciated
[16:37:51] * ryankemper got distracted
[16:37:59] <jbond>	 :) no problem 
[16:39:33] <jbond>	 dduvall: https://apt.wikimedia.org/wikimedia/dists/buster-wikimedia/thirdparty/gitlab-runner/ is there now
[16:39:54] <dduvall>	 jbond: \o/ and `apt-cache showpkg gitlab-runner` shows it
[16:39:58] <dduvall>	 thanks!
[16:40:04] <jbond>	 great and no probs
[16:45:10] <wikibugs>	 (03PS1) 10Urbanecm: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420)
[16:49:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:51:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:53:39] <wikibugs>	 (03PS14) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:00:05] <jouncebot>	 chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1700).
[17:03:19] <wikibugs>	 (03PS1) 10Jbond: realm.pp: update to use structured facts [puppet] - 10https://gerrit.wikimedia.org/r/715766
[17:03:36] <wikibugs>	 (03PS15) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:04:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30935/console" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond)
[17:04:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, to be tested in cloud too to be sure" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond)
[17:05:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30936/console" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond)
[17:05:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "noop in cloud and prod" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond)
[17:06:17] <wikibugs>	 (03PS16) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:10:13] <wikibugs>	 (03CR) 10Wolfgang Kandek: [C: 03+1] "Approved, excellent for Arnold's progress in onboarding." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[17:10:53] <wikibugs>	 (03PS17) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:17:10] <wikibugs>	 (03PS18) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:21:52] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Intent LGTM -- implementation looks good too but I don't know this code well. :)" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[17:24:06] <wikibugs>	 (03PS19) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:34:01] <wikibugs>	 (03PS20) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504)
[17:36:24] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:36:30] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:38:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:44:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:55:38] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:55:46] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:00:05] <jouncebot>	 Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1800)
[18:01:28] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:01:34] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:03:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10odimitrijevic) Approved! Apologies for the delay.
[18:03:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10odimitrijevic) Approved!
[18:04:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10odimitrijevic) Approved.
[18:05:46] <XioNoX>	 !log re-pool eqsin-codfw link
[18:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:09:16] <wikibugs>	 (03CR) 10ODimitrijevic: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[18:09:48] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Legoktm) >>! In T287546#7322191, @sgrabarczuk wrote: > @Legoktm I'd like to check again because I may need to make a tweak in [[https://me...
[18:10:15] <wikibugs>	 (03CR) 10ODimitrijevic: [C: 03+1] "Btw, agree to have both of us as approvers. This was not on my radar and now that it is I will be paying attention to the timely approvals" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[18:15:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:18:03] <wikibugs>	 (03PS1) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080)
[18:19:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:19:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[18:21:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:24:43] <twentyafterfour>	 !log ran `scap prep 1.37.0-wmf.21` and `scap apply-patches --train 1.37.0-wmf.21` refs T281162
[18:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:48] <stashbot>	 T281162: 1.37.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T281162
[18:25:54] <wikibugs>	 (03PS2) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080)
[18:27:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[18:28:41] <wikibugs>	 (03PS3) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080)
[18:28:55] <wikibugs>	 (03PS1) 1020after4: testwikis wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773
[18:28:57] <wikibugs>	 (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773 (owner: 1020after4)
[18:30:02] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773 (owner: 1020after4)
[18:30:05] <logmsgbot>	 !log twentyafterfour@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.21  refs T281161
[18:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:10] <stashbot>	 T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161
[18:34:30] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud osmdb: set num_threads in the sync job [puppet] - 10https://gerrit.wikimedia.org/r/715623 (https://phabricator.wikimedia.org/T285668) (owner: 10Bstorm)
[18:34:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:35:18] <wikibugs>	 (03PS2) 10Bstorm: cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668)
[18:35:20] <icinga-wm>	 PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:26] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:34] <icinga-wm>	 PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:34] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:52] <dancy>	 hmm
[18:36:07] <twentyafterfour>	 wth
[18:36:20] <icinga-wm>	 PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100%
[18:36:29] <dancy>	 I don't know what those hosts do but that looks bad. 
[18:36:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:36:49] <twentyafterfour>	 I'm going to guess this isn't related to train ... given that I haven't deployed to anything yet 
[18:36:59] <dancy>	 nod.
[18:37:11] <twentyafterfour>	 it's just syncing masters right now 
[18:37:36] <icinga-wm>	 RECOVERY - Host cp5014 is UP: PING WARNING - Packet loss = 71%, RTA = 222.91 ms
[18:37:46] <dancy>	 how nice
[18:37:48] <icinga-wm>	 RECOVERY - Host cp5011 is UP: PING WARNING - Packet loss = 50%, RTA = 291.12 ms
[18:37:48] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING WARNING - Packet loss = 66%, RTA = 292.37 ms
[18:37:48] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING WARNING - Packet loss = 75%, RTA = 223.54 ms
[18:37:52] <icinga-wm>	 RECOVERY - Host cp5003 is UP: PING OK - Packet loss = 0%, RTA = 236.68 ms
[18:37:55] <dancy>	 nothing to see here folks
[18:38:32] <wikibugs>	 (03PS1) 10Ssingh: durum: add test information to the results [puppet] - 10https://gerrit.wikimedia.org/r/715776
[18:39:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30937/console" [puppet] - 10https://gerrit.wikimedia.org/r/715776 (owner: 10Ssingh)
[18:40:46] <icinga-wm>	 PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100%
[18:40:58] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[18:40:58] <icinga-wm>	 PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100%
[18:41:07] <wikibugs>	 (03PS6) 10Legoktm: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[18:41:16] <icinga-wm>	 RECOVERY - Host cp5014 is UP: PING WARNING - Packet loss = 75%, RTA = 222.74 ms
[18:41:24] <icinga-wm>	 PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100%
[18:41:24] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:41:32] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING WARNING - Packet loss = 90%, RTA = 293.36 ms
[18:41:32] <icinga-wm>	 RECOVERY - Host cp5003 is UP: PING WARNING - Packet loss = 71%, RTA = 236.60 ms
[18:41:34] <icinga-wm>	 RECOVERY - Host cp5011 is UP: PING OK - Packet loss = 0%, RTA = 291.14 ms
[18:41:34] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 222.97 ms
[18:41:52] <wikibugs>	 (03PS7) 10Legoktm: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[18:42:37] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add test information to the results [puppet] - 10https://gerrit.wikimedia.org/r/715776 (owner: 10Ssingh)
[18:44:16] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30939/console" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[18:44:18] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777
[18:47:11] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "Do we need to ensure => absent first, or can it just be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[18:47:47] <wikibugs>	 (03PS3) 10Legoktm: mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[18:54:21] <wikibugs>	 (03CR) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[18:54:50] <wikibugs>	 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10sbassett) >>! In T288844#7321649, @mepps wrote: > It sounds like @sbassett is moving forward with looking into this.  Er, whoops, I'm a...
[18:56:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:59:22] <dancy>	 Tons of production errors right now.
[19:00:00] * dancy checks the source host
[19:00:04] <jouncebot>	 twentyafterfour and dancy: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1900).
[19:00:51] <dancy>	 mw2296.   thwikisource.
[19:01:06] <urbanecm>	 only that wiki/host dancy ?
[19:01:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:02:36] <urbanecm>	 answering myself: looks so
[19:03:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:05:58] <logmsgbot>	 !log twentyafterfour@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.21  refs T281161 (duration: 35m 53s)
[19:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:03] <stashbot>	 T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161
[19:07:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro)
[19:07:29] <twentyafterfour>	 that's odd. 
[19:07:39] <urbanecm>	 now moved to mw2318
[19:07:57] <urbanecm>	 and mw2251
[19:08:19] <urbanecm>	 those three hosts only
[19:08:55] <twentyafterfour>	 50k errors is crazy for thwikisource 
[19:09:41] <twentyafterfour>	 and it's on wmf.20 not 21 
[19:09:51] <urbanecm>	 and that wiki normally only has 6k views per day
[19:10:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:10:38] <urbanecm>	 twentyafterfour: were there any warnings in scap about any of the three hosts this happened on?
[19:10:57] <twentyafterfour>	 no nothing 
[19:11:17] <urbanecm>	 hmm...
[19:11:57] <twentyafterfour>	 and the deployment was just for wmf.21 (though there could have been unsynced change that inadvertantly got synced with the train?)
[19:12:09] <twentyafterfour>	 since the testwiki deployment does sync-world 
[19:12:36] <urbanecm>	 that alone wouldn't explain a) why it happens only on three servers b) why it happens on such low-traffic wiki only
[19:12:54] <twentyafterfour>	 yeah that part I don't know
[19:13:23] <twentyafterfour>	 /srv/mediawiki/php-1.37.0-wmf.20/extensions/Scribunto/includes/common/ApiScribuntoConsole.php(102): Scribunto_LuaEngine->runConsole(array)
[19:13:41] <urbanecm>	 someone sending a lot of crazy input into console?
[19:13:49] <twentyafterfour>	 yeah ... 
[19:13:52] <urbanecm>	 let me check
[19:14:53] <twentyafterfour>	 Scribunto_LuaSandboxInterpreter->callFunction(LuaSandboxFunction, LuaSandboxFunction, LuaSandboxFunction)
[19:15:23] <twentyafterfour>	 it's not happening anymore
[19:15:43] <urbanecm>	 i'm still personally curious why it happened at all though :)
[19:15:59] <twentyafterfour>	 50,004  errors 
[19:16:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:16:11] <twentyafterfour>	 I think someone created an infinite recursion or something 
[19:16:27] <twentyafterfour>	 or just a loop over 50k items 
[19:17:01] <urbanecm>	 yeah
[19:17:37] <twentyafterfour>	 I'm not even sure that this warning should be showing up in the production errors logstash dashboard...
[19:17:52] <urbanecm>	 well it's a PHP warning, so...yes :)
[19:18:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:18:01] <legoktm[m]>	 What's the actual exception/warning?
[19:18:08] <twentyafterfour>	 PHP Warning: mb_substr() expects parameter 2 to be integer, float given
[19:18:13] <urbanecm>	 [c404b613-22e3-443e-b4ec-24a4082e2137] /w/api.php   PHP Warning: mb_substr() expects parameter 2 to be integer, float given
[19:18:14] <twentyafterfour>	 	
[19:18:17] <twentyafterfour>	 from /srv/mediawiki/php-1.37.0-wmf.20/extensions/Scribunto/includes/engines/LuaCommon/UstringLibrary.php(319)
[19:18:20] <urbanecm>	 50k times at thwikisource
[19:18:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1088.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:18:42] <legoktm[m]>	 Probably a bug in Scribunto's parameter validation then 
[19:18:53] <urbanecm>	 likely
[19:19:24] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668) (owner: 10Bstorm)
[19:20:25] <twentyafterfour>	 https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/UstringLibrary.php#319
[19:20:47] <twentyafterfour>	 at line 304 it checks for 'number' not 'int' 
[19:20:50] <brennen>	 !log gitlab1001: brief downtime for testing reconfiguration of cas3.session_duration
[19:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:09] <urbanecm>	 twentyafterfour: i think that's correct. https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/LibraryBase.php#141 compares that with lua type, not php type. https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/LibraryBase.php#106 maps int to number.
[19:22:38] <twentyafterfour>	 hmmm
[19:23:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:24:06] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) Confirmed working for a couple of us, thanks again.
[19:27:23] <twentyafterfour>	 I'd say just casting to int wouldn't be the worst idea, when the parameter is_numeric 
[19:30:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:34:52] <wikibugs>	 (03PS1) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142)
[19:35:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron)
[19:37:14] <wikibugs>	 (03PS2) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142)
[19:39:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) @lmata will do
[19:42:48] <wikibugs>	 (03PS2) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615)
[19:44:55] <wikibugs>	 (03PS3) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615)
[19:45:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-htriedman-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:47:00] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:48:07] <wikibugs>	 (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[19:49:55] <wikibugs>	 (03PS4) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615)
[19:51:57] <wikibugs>	 (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[19:56:03] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] P::toolforge::redis_sentinel: Block REPLICAOF too [puppet] - 10https://gerrit.wikimedia.org/r/715703 (owner: 10Majavah)
[19:57:01] <twentyafterfour>	 I guess it's probably good to deploy to group 0?  I don't see anything terrible happening
[19:57:30] <dancy>	 looks ok
[19:57:46] <wikibugs>	 (03PS1) 1020after4: group0 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785
[19:57:48] <wikibugs>	 (03CR) 1020after4: [C: 03+2] group0 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785 (owner: 1020after4)
[19:58:51] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785 (owner: 1020after4)
[20:00:30] <logmsgbot>	 !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.21  refs T281161
[20:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:36] <stashbot>	 T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161
[20:11:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:54] <wikibugs>	 (03CR) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[20:16:01] <wikibugs>	 (03PS2) 10Legoktm: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636
[20:16:03] <wikibugs>	 (03PS3) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637
[20:16:05] <wikibugs>	 (03PS3) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857)
[20:17:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) cloudcephosd1021 C8  u31. port 0/1  cableid   11034/11032     cloudsw2-c8-eqiad cloudcephosd1022  C8 u32. port 2/3   cableid    11033/11031    cloudsw...
[20:18:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr)
[20:18:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[20:19:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:00] <wikibugs>	 (03PS4) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857)
[20:38:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:40:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:47:03] <wikibugs>	 (03CR) 10Legoktm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/30943/" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[20:48:14] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30944/console" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[20:51:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:59:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:00:26] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] "Cherry picked on gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud and successfully tested on runner-1002." [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall)
[21:07:06] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[21:19:36] <wikibugs>	 (03PS4) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881)
[21:19:38] <wikibugs>	 (03PS3) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881)
[21:19:40] <wikibugs>	 (03PS6) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[21:20:46] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:21:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[21:29:29] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777 (owner: 10PipelineBot)
[21:30:15] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] "PS4 is trivial rebasing changes of PS3 which got a +1 from Effie." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[21:32:49] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777 (owner: 10PipelineBot)
[21:33:05] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[21:34:26] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[21:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:51] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[21:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:39] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[21:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm)
[21:48:43] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "One comment inline, otherwise this looks good!  Thanks!" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[21:52:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: add thanos::recording_rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron)
[21:55:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Excellent commit message.  It clearly outlined the problem and at what stage of resolution this is." [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[22:04:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:06:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:11:44] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[22:18:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:20:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:29:48] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804
[22:35:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:39:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:42:34] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b]: Regular analytics weekly train v0.1.17 [analytics/refinery@a0f039b]
[22:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:45:37] <wikibugs>	 (03PS4) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881)
[22:45:39] <wikibugs>	 (03PS7) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[22:47:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:47:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[22:52:50] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[22:53:02] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[22:53:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[22:54:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[22:58:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T2300).
[23:00:05] <jouncebot>	 dpifke and MatmaRex: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[23:00:14] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b]: Regular analytics weekly train v0.1.17 [analytics/refinery@a0f039b] (duration: 17m 39s)
[23:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:32] <MatmaRex>	 hiii
[23:00:36] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[23:00:37] <urbanecm>	 Hi MatmaRex 
[23:00:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:00:43] <urbanecm>	 And hi dpifke 
[23:00:43] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b] (thin): Regular analytics weekly train THIN v0.1.17 [analytics/refinery@a0f039b]
[23:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:48] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[23:00:50] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b] (thin): Regular analytics weekly train THIN v0.1.17 [analytics/refinery@a0f039b] (duration: 00m 07s)
[23:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:10] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[23:01:12] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b] (hadoop-test): Regular analytics weekly train TEST v0.1.17 [analytics/refinery@a0f039b]
[23:01:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804 (owner: 10Bartosz Dziewoński)
[23:02:07] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804 (owner: 10Bartosz Dziewoński)
[23:03:13] <urbanecm>	 MatmaRex: available at mwdebug2001, please review
[23:03:43] <MatmaRex>	 looking
[23:04:57] <MatmaRex>	 yeah, seems as expected
[23:05:07] <urbanecm>	 great, syncing
[23:05:29] <MatmaRex>	 i got distracted by the fact that ko.wikipedia apparently has non-monospace font in the editor
[23:05:58] <urbanecm>	 :)
[23:06:12] <wikibugs>	 (03PS2) 10Urbanecm: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420)
[23:06:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm)
[23:07:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm)
[23:07:17] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8997ae5d0b998839853aed2b246f5c88fe9d83eb: Fix wgDiscussionTools_sourcemodetoolbar settings (duration: 01m 22s)
[23:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:24] <urbanecm>	 MatmaRex: should be live. Enjoy!
[23:07:34] <MatmaRex>	 thanks
[23:08:00] <urbanecm>	 any time.
[23:08:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:59] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1437d99c1884c0695f02b81b724ec82a2bd3362e: Enable link recommendation frontent in dewiki and nlwiki (T288420, T285254) (duration: 01m 06s)
[23:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:03] <stashbot>	 T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254
[23:09:03] <stashbot>	 T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420
[23:09:08] <urbanecm>	 dpifke: hi, do you want to self-deploy?
[23:09:12] <urbanecm>	 (if so, go ahead)
[23:14:30] <dpifke>	 urbanecm: Yes, doing now.  (Sorry, got pulled away for a bit.)
[23:14:55] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b] (hadoop-test): Regular analytics weekly train TEST v0.1.17 [analytics/refinery@a0f039b] (duration: 13m 42s)
[23:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:14] <urbanecm>	 dpifke: np. So, I'm disconnecting from prod and leaving you to do your stuff :-)
[23:15:32] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+2] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[23:15:49] <mforns>	 !log failed deployment of refinery (v0.1.17) to an-test-coord1001.eqiad.wmnet (scap error)
[23:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:19] <wikibugs>	 (03Merged) 10jenkins-bot: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[23:17:44] <dpifke>	 Going to test on mwdebug2001 first.
[23:22:40] <dpifke>	 Looks OK, pushing further.
[23:23:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:45] <logmsgbot>	 !log dpifke@deploy1002 scap failed: average error rate on 3/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details)
[23:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:11] <dpifke>	 Looking in Logstash...
[23:28:04] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[23:29:13] <dpifke>	 Not sure why that looked good on mwdebug, it's broken.  Reverting.
[23:30:01] <wikibugs>	 (03PS1) 10Dave Pifke: Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807
[23:30:23] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+2] Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807 (owner: 10Dave Pifke)
[23:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807 (owner: 10Dave Pifke)
[23:31:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:02] <logmsgbot>	 !log dpifke@deploy1002 Synchronized wmf-config/profiler.php: Revert excimer-k8s pipelines T288165 (duration: 01m 14s)
[23:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:06] <stashbot>	 T288165: Create separate ArcLamp pipeline for k8s-mwdebug - https://phabricator.wikimedia.org/T288165
[23:33:48] <dpifke>	 OK, I'm done for today.   Will debug the patch and try again tomorrow.
[23:35:13] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[23:37:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:38:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:41:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log