[00:02:17] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:08:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:11:39] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) a:03Legoktm [00:12:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:15:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:20:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:22:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:25:29] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:33] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:47:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:47:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:52:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:56:46] (03PS1) 10Legoktm: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 [00:56:48] (03PS1) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 [00:56:50] (03PS1) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) [00:58:04] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [01:00:10] (03PS2) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 [01:00:12] (03PS2) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) [01:00:38] rsync::quickdatacopy is indented with 6 spaces, and it's totally throwing my editor off [01:01:57] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:06] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30910/console" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [01:18:08] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30911/console" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [01:18:57] (03CR) 10Legoktm: "The one thing I'm not sure of is where I'm supposed to set the rsync::server::wrap_with_stunnel hiera." [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [01:25:06] (03CR) 10Legoktm: [C: 04-1] "This doesn't work how I want for deployment::rsync because in that module we have the IPs of the hosts, not the actual fqdns. We could eit" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [01:25:25] (03CR) 10Legoktm: [C: 04-1] "Fails PCC because of the comment I just left on Change-Id: I3964a58b736892f5f7d978606d7b80cb5b3e0ddf" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [01:25:55] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:13] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) I'm guessing no one has done this until now because deployment::rsync was using hand-rolled rsync + timer rather than quickdatacopy. I gave it... [01:38:22] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Legoktm) {T289857} has some notes on how to enable stunnel for this. However the #mw-on-k8s image building process also performs an rsync against the releases host, so it might also n... [01:59:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T0200) [02:01:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:06:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643 [02:06:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643 (owner: 10TrainBranchBot) [02:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:33] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.21 [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715643 (owner: 10TrainBranchBot) [02:31:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:15] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:25] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:17] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:05] (03PS1) 10Marostegui: Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/715523 [04:26:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:05] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:02:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:02:41] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:02:47] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:17:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:26:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:49:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:53:19] (03CR) 10Marostegui: [C: 03+2] Revert "db2110: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/715523 (owner: 10Marostegui) [05:55:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17109 and previous config saved to /var/cache/conftool/dbconfig/20210831-055546-root.json [05:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:51] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:01:53] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:27] !log Rename flaggedrevs_stats2 and flaggedrevs_stats on dewiki codfw T289050 [06:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:31] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [06:07:27] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17110 and previous config saved to /var/cache/conftool/dbconfig/20210831-061049-root.json [06:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:54] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:19:27] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:25:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17111 and previous config saved to /var/cache/conftool/dbconfig/20210831-062553-root.json [06:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:59] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:29:15] (03CR) 10Volans: [C: 03+2] quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [06:29:46] (03Merged) 10jenkins-bot: quotereviewer: add support for portal quotes [software] - 10https://gerrit.wikimedia.org/r/715025 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [06:29:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:30:39] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 610 ge 480 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [06:34:47] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:05] PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100% [06:38:33] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [06:38:37] PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100% [06:38:51] PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100% [06:39:09] PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:40:15] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [06:40:15] PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17112 and previous config saved to /var/cache/conftool/dbconfig/20210831-064056-root.json [06:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:02] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:41:47] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/HTTPS [06:42:03] PROBLEM - Host cp5016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:42:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:42:45] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5015 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 511519 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2021-11-16 23:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:43:03] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:45:38] rack down in eqsin? [06:45:46] https://netbox.wikimedia.org/dcim/racks/78/ [06:46:33] weird I can ssh to cp5003 and ping other cp nodes [06:47:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:57] elukey: maybe some network link down causing only partial unavailability? [06:48:30] I am checking what's wrong [06:48:33] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:49:16] from cp5011 I can ping icinga.wikimedia.org only via v6 [06:50:18] XioNoX, topranks - around? [06:50:19] (03PS2) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 [06:50:43] yo [06:50:47] (03CR) 10Majavah: toolforge: remove portgrabber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [06:50:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [06:51:27] (03PS3) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 [06:51:46] XioNoX: hello! I am not sure what's happening, but it seems that icinga fails to reach (via ipv4 afaics) a rack in eqsin [06:52:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:35] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:52:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:06] looking [06:53:28] XioNoX: better - I wasn't able to ping from one of the affected cp nodes to icinga.wikimedia.org via v4, but I just tried from alert2001 and both works [06:53:34] (v4 and v6) [06:53:43] I don't see Varnish traffic issues and I can ssh to nodes [06:54:03] looks like the eqsin-codfw link flapped [06:55:51] but I thought we had the alternative path via ulsfo, I expected some recovery [06:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17113 and previous config saved to /var/cache/conftool/dbconfig/20210831-065600-root.json [06:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:06] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:56:07] yeah we do [06:56:29] trying to force a recheck in icinga [06:57:37] elukey: IPv4 still doesn't fully go through [06:58:03] I'm going to drain the telia link [06:58:08] ack thanks [06:58:34] I just noticed from cp5003 that I can reach alert2001 but not 1001 via v4 [06:58:58] and traceroute says cr1-codfw [06:59:16] v6 goes through ulsfo, ok it makes sense [07:00:51] yeah it's silently dropping traffic but not dropping BFD/OSPF... [07:00:55] already happened in the past [07:00:58] lovely [07:01:29] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:45] !log drain eqsin-codfw link [07:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:57] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 232.86 ms [07:01:57] RECOVERY - Host cp5003 is UP: PING OK - Packet loss = 0%, RTA = 232.39 ms [07:01:57] RECOVERY - Host cp5011 is UP: PING WARNING - Packet loss = 80%, RTA = 319.29 ms [07:02:01] RECOVERY - Host cp5014 is UP: PING OK - Packet loss = 0%, RTA = 232.48 ms [07:02:03] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 302.19 ms [07:02:33] goooood [07:02:41] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) @Ottomata that looks unrelated to your chance (but related to yours @Jelto ). We will take a look! [07:03:04] thanks XioNoX and majavah [07:03:12] elukey: thank you [07:03:16] I'll follow up with Telia [07:03:27] RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.95 ms [07:03:27] RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 246.40 ms [07:03:27] RECOVERY - Host cp5016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 251.24 ms [07:03:27] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.37 ms [07:04:56] (03CR) 10Majavah: [C: 03+2] "I'm going to merge this but not do a release until we either start building a grid on a newer Debian release or need to make one for other" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [07:05:26] (03Merged) 10jenkins-bot: Do not compare OS versions [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [07:06:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:17] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:56] (03PS1) 10DCausse: query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517) [07:12:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:31] (03PS1) 10JMeybohm: kube_env: Error out of user has no read permission to kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/715698 [07:12:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [07:15:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:46] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [07:22:43] elukey, XioNoX: thanks! I see a 5xx blip in eqsin between 6:38 and 6:45, nothing else [07:23:19] https://w.wiki/3zMC [07:26:37] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:59] (03CR) 10Ema: [V: 03+2 C: 03+2] Add Varnish SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [07:32:46] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [07:32:59] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [07:39:22] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [07:40:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [07:44:27] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:43] !log Optimize ruwiki.flaggedtemplates T290057 [07:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:48] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [07:45:13] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:32] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10ema) @herron: I've merged the patch, forced a puppet run on grafana1002.eqiad.wmnet, and followed the instructions at https://wikitech.wik... [07:48:18] (03PS1) 10Majavah: P::toolforge::apt_pinning: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/715700 [07:52:04] (03PS1) 10Majavah: toolforge: drop legacy webservice endpoints on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/715701 [07:56:44] (03CR) 10Kosta Harlan: bullseye-sssd: Add openssh-client (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan) [08:01:14] 10SRE, 10DNS, 10Traffic: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10fgiunchedi) p:05Triage→03Medium [08:02:47] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:46] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: Add bgwiki (Bethany) to the list of privileged ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/715443 (https://phabricator.wikimedia.org/T289892) (owner: 10Jcrespo) [08:05:09] !log Optimize plwiktionary.flaggedtemplates T290057 [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:14] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [08:06:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Open→03Resolved This is complete, thank you all! [08:09:13] ^ godog, as I said on the patch, that is missing the actual group change [08:10:11] jynus: doh, I missed it [08:11:25] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Resolved→03Open It wasn't resolved, still pending a group change [08:11:39] jynus: so ok to add to wmf correct? [08:11:57] yep, only was waiting on her updating her email on wikitech [08:12:00] which was done [08:13:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:14:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:14:46] 10SRE, 10Wikimedia-Mailing-lists: Emails on wlm-announce seem not to have arrived (due to banlist) - https://phabricator.wikimedia.org/T289928 (10Aklapper) [08:15:52] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T289892 (10fgiunchedi) 05Open→03Resolved We're all done, please verify access @Bethany ! [08:18:13] !log Optimize cewiki.flaggedtemplates T290057 [08:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:19] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [08:19:17] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot) [08:22:10] (03CR) 10Abijeet Patro: "Sorry, in favor of: I2a0adb06199c1b3d818a8fce7d80769f0c503948" [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714987 (owner: 10L10n-bot) [08:25:22] (03CR) 10Abijeet Patro: [V: 03+2] "Looks good." [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/715477 (owner: 10L10n-bot) [08:25:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:15] 10SRE, 10Wikimedia-Mailing-lists: Make customized Mailman3 templates translatable - https://phabricator.wikimedia.org/T282018 (10abi_) [08:32:12] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) 05Open→03Resolved We have the necessary permissions now. Exports are working properly. Resolving this ta... [08:38:09] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:38:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 130, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:40] !log Optimize plwiki.flaggedtemplates T290057 [08:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:45] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [08:55:09] (03PS1) 10Majavah: P::toolforge::redis_sentinel: Block REPLICAOF too [puppet] - 10https://gerrit.wikimedia.org/r/715703 [08:59:27] (03CR) 10David Caro: [C: 03+1] "Good catch" [puppet] - 10https://gerrit.wikimedia.org/r/715703 (owner: 10Majavah) [09:01:59] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:57] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:41] (03CR) 10Michael Große: "T235292 has been adjusted to also include P360" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [09:26:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:38] 10SRE, 10DNS, 10Traffic: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) I guess you also need a proper CAA record to authorize AWS CA to issue certs for learn.wiki [09:44:59] (03PS1) 10Vgutierrez: learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025) [09:50:05] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) From https://docs.aws.amazon.com/acm/latest/userguide/setup-caa.html it looks like any of amazon.com, amazontrust.com, awstrust.com or amazonaws.com would do it as... [10:02:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:54] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10TheDJ) Just a thank you to Tim and Lego for working on this for all that time. I know its been quite a bit o... [10:10:57] (03CR) 10Hnowlan: [C: 03+2] "nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715552 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:11:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet [10:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:18] (03CR) 10Ema: [C: 03+1] learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025) (owner: 10Vgutierrez) [10:12:23] (03PS2) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) [10:12:55] (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: Add LB and CA validation CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/715706 (https://phabricator.wikimedia.org/T290025) (owner: 10Vgutierrez) [10:14:51] !log Optimize huwiki.flaggedtemplates T290057 [10:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:56] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [10:16:53] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: More DNS entries for WikiLearn servers - https://phabricator.wikimedia.org/T290025 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@carrot:~$ host -t CAA learn.wiki learn.wiki has CAA record 0 issue "letsencrypt.org" learn.wiki has CAA record 0... [10:16:59] (03PS3) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) [10:17:03] (03CR) 10Ladsgroup: Set permission of creating short url to everyone everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [10:17:10] (03PS1) 10Hnowlan: aqs_next: use same druid datasource as aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/715712 [10:17:37] (03PS3) 10Jbond: confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:18:34] (03CR) 10Jbond: [C: 03+2] confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:18:41] (03CR) 10Joal: [C: 03+1] "Thanka a lot @hnowlan :)" [puppet] - 10https://gerrit.wikimedia.org/r/715712 (owner: 10Hnowlan) [10:23:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet [10:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:41] (03CR) 10Hnowlan: [C: 03+2] aqs_next: use same druid datasource as aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/715712 (owner: 10Hnowlan) [10:25:47] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:32:39] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10jcrespo) FYI I have now started backups of commonswiki with only 4 read threads on eqiad. So far I've seen no impact on latency, and not even on the total amount of reads/s (it is very serial, s... [10:38:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master [10:38:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master [10:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:42] (03CR) 10MVernon: "Hi," [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [10:56:02] (03CR) 10MVernon: Fix dnspython 2 compat (031 comment) [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1100) [11:00:05] MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:24] hello [11:00:29] I can deploy today [11:00:32] Hello MatmaRex [11:00:56] (03PS3) 10Urbanecm: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński) [11:01:16] (03CR) 10Urbanecm: [C: 03+2] Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński) [11:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:02] (03Merged) 10jenkins-bot: Offer the DiscussionTools reply tool as opt-out setting at 21 Wikipedias ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715574 (https://phabricator.wikimedia.org/T288483) (owner: 10Bartosz Dziewoński) [11:03:50] MatmaRex: your patch is at mwdebug2001, can you have a look? [11:04:15] yep [11:04:56] seems good [11:05:03] tested at kowiki [11:05:12] syncing [11:06:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eb482e3fa88a87166b990fd9b87d0ccbbf971290: Offer the DiscussionTools reply tool as opt-out setting at 21 phase 2 Wikipedias (T288483) (duration: 00m 57s) [11:06:48] 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10ayounsi) I removed the management routers from the wrong alert, that's why we got paged again. It's now fixed so it won't page wh... [11:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:52] T288483: Deploy config to make Reply Tool available as opt-out at phase 2 wikis - https://phabricator.wikimedia.org/T288483 [11:06:55] MatmaRex: here you go! [11:06:58] anything else? [11:07:30] that's all, thanks [11:07:36] any time :) [11:07:57] !log EU B&C window done [11:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:20] or actually... [11:08:33] (03PS1) 10Urbanecm: updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971) [11:08:39] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm) [11:08:57] let's get this out too [11:09:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:54] (03CR) 10Jbond: [C: 03+2] lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [11:21:06] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Backup of commonswiki started, around 70K files backed up (slowly) so far: ` root@db1176.eqiad.wmnet[mediabackups]>... [11:26:57] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:19] (03Merged) 10jenkins-bot: updateMenteeData: Send timing to statsd [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715524 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm) [11:28:24] \o [11:31:42] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/updateMenteeData.php: 53a1856128edb4ec3a5ea8840fb6755a1703f7ac: updateMenteeData: Send timing to statsd (T278971) (duration: 00m 57s) [11:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:48] T278971: Mentor dashboard: M1 mentee overview module - https://phabricator.wikimedia.org/T278971 [11:31:48] * urbanecm done for real [11:32:24] (03PS1) 10Hnowlan: maps1009: remove temporary overrides [puppet] - 10https://gerrit.wikimedia.org/r/715721 [11:33:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:19] 10SRE, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) a:03jcrespo It's happening: https://www.youtube.com/watch?v=imbGdfzckrI See T262668#7321326 for details. [11:38:38] there was now a latency spike on swift, but it is recent [11:39:20] (03PS1) 10Urbanecm: mediawiki/maintenance/growthexperiments.pp: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) [11:40:08] was something deployed at around 11:26? [11:40:58] jynus: https://sal.toolforge.org/log/30T6m3sB8Fs0LHO5r1aT, but that has zero chance to do anything to swift [11:41:13] that's what I would thought [11:41:30] maybe a normal cache thing or something? or something else [11:41:43] or a requests spike? [11:42:05] that is actually quite constant [11:42:13] (03CR) 10Hnowlan: [C: 03+2] maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 (owner: 10MSantos) [11:42:34] but it could be traffic related, yes [11:42:45] I will check cache graphs [11:43:38] or your commonswiki backup maybe jynus? [11:43:59] that is what I wanted to know, but I've been running it for over an hour [11:44:11] and this is only from :26 [11:44:16] i see [11:44:26] there is some unavailablity on ulsfo and upload [11:44:31] starting at that time [11:44:43] but that could be just a consequence of increased latency [11:46:26] there is an increase in network io [11:46:44] 2 spikes [11:48:14] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=17&orgId=1&from=1630388886051&to=1630410426051&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops [11:50:15] things seem back to normal [11:51:45] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 563 ge 480 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [12:01:50] I'll take a look at the indexing errors [12:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:47] looks like it was a spike [12:08:11] (03PS1) 10Jbond: admin: droprequire [puppet] - 10https://gerrit.wikimedia.org/r/715724 (https://phabricator.wikimedia.org/T263578) [12:09:01] (03CR) 10Jbond: [C: 03+2] admin: droprequire [puppet] - 10https://gerrit.wikimedia.org/r/715724 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:10:23] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10dr0ptp4kt) @jcrespo we have not allocated a wikimedia.org email address. Is that required? If so, I'll ask ITS to provision one. I'm out the rest of the day, heads up. [12:15:28] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Dzahn) All files sent to releases are meant to be available to the world though. Does it still matter to encrypt traffic internally for something like this? [12:25:31] (03PS1) 10Dzahn: load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538) [12:26:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:27] (03PS1) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [12:30:07] (03CR) 10jerkins-bot: [V: 04-1] admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:32:51] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:22] (03PS1) 10Dzahn: admin: create a group to run the wmf-auto-reimage cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/715729 [12:40:01] (03PS2) 10Dzahn: admin: create a group to run the wmf-auto-reimage cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/715729 [12:42:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [12:42:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [12:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:36] (03CR) 10Dzahn: "Try Stdlib::Host instead of ::Fqdn. That would cover both IPs and hosts and I have been told by other reviewers to use that instead in oth" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [12:45:00] (03CR) 10Volans: [C: 04-1] "FYI there is https://phabricator.wikimedia.org/T289779 with a slightly more generic approach that doesn't apply to this specific use case." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [12:45:44] (03PS2) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [12:46:12] (03PS3) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [12:48:54] (03CR) 10MSantos: [C: 04-1] "I was with the impression we were going to re-enable for codfw, eqiad needs to have those set until the new mapping is applied with a new " [puppet] - 10https://gerrit.wikimedia.org/r/715721 (owner: 10Hnowlan) [12:50:05] (03CR) 10Dzahn: "Yes, thank you, though this is specifically meant to be a quick fix that can be removed again as soon as we have better generic solutions." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [12:51:41] (03PS3) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 [12:52:00] !log run kubectl scale deployments.apps -n ci mediawiki-bruce --replicas=0 to stop ImagePulling and reduce io on kubestage1001 [12:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:36] (03CR) 10Dzahn: [C: 03+2] load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:54:35] (03Merged) 10jenkins-bot: load mod_alias to be able to use Redirect Directive [container/miscweb] - 10https://gerrit.wikimedia.org/r/715727 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:59:05] !log [urbanecm@mwmaint2002 ~]$ sudo -u www-data kill 133282 # stop updateMenteeData.php at frwiki [12:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:40] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [13:02:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:26] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:42] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#7320889, @JMeybohm wrote: > @Ottomata that looks unrelated to your chance (but related to yours @Jelto ). We will take a l... [13:06:23] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:09] (03PS1) 10Jbond: admin: create new sre-admins group to match the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) [13:10:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30914/console" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [13:11:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: create new sre-admins group to match the ldap group [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [13:15:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30915/console" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [13:19:48] (03PS4) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:20:55] (03PS1) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) [13:21:35] (03PS5) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:21:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30916/console" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [13:21:59] (03PS6) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:22:07] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) IMHO yes, we should encrypt traffic unless we have reasons not to (e.g. system is going to be retired, too hard/complex to implement vs advantages, etc) [13:22:34] (03CR) 10jerkins-bot: [V: 04-1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:22:55] (03PS7) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:23:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:30] (03CR) 10jerkins-bot: [V: 04-1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:24:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:25:06] (03PS8) 10Jbond: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:26:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30918/console" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:27:17] (03CR) 10Jbond: [V: 03+1] "I have added the sre-admins group and updated this CR to use that instead. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:28:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1010.eqiad.wmnet [13:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] (03CR) 10Dzahn: "heh, thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:32:46] (03PS1) 10Filippo Giunchedi: icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) [13:34:50] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:52] (03CR) 10Dzahn: [C: 03+1] "this is option b) from https://phabricator.wikimedia.org/T289746#7311563 . lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi) [13:36:16] (03CR) 10Effie Mouzeli: [C: 03+1] profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [13:37:43] !log Start `mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --wiki=nlwiki --verbose` in a tmux session at mwmaint2002 [13:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:48] (03CR) 10Dzahn: [C: 03+1] "New users just need to be aware of the caveat. It's possible to login both with and without capitalization (we just slapped apache auth in" [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi) [13:39:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please add a PCC run too" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [13:40:19] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) Great! Proceeding... [13:40:24] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:28] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:15] (03CR) 10Jbond: "see inlines for nits" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [13:43:29] (03CR) 10Jbond: [V: 03+1 C: 03+1] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:45:48] (03CR) 10Dzahn: "We have a process how to add new members to existing groups but not really for how to create new admin groups. Let's get a +1 from Wolfgan" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:47:06] !log disable puppet fleet wide to preform puppetdb maintance T263578 [13:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:12] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [13:53:16] PROBLEM - Host puppetdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30919/console" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe) [13:54:57] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM, thank you for taking care of this. I'll deploy it next week when I'm off clinic duty" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe) [13:55:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_puppetdb site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:58:22] RECOVERY - Host puppetdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [13:59:33] (03CR) 10Jcrespo: [C: 03+1] icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi) [14:00:32] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add dancy,thcipriani,hashar to icinga authorized service/host [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi) [14:01:46] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on puppetdb1002.eqiad.wmnet with reason: puppetdb maintance - T289779 [14:02:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on puppetdb1002.eqiad.wmnet with reason: puppetdb maintance - T289779 [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:08] T289779: Creat a new ldap group for sre users without root access - https://phabricator.wikimedia.org/T289779 [14:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:21] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintance - T289779 [14:02:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on puppetdb2002.codfw.wmnet with reason: puppetdb maintance - T289779 [14:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:20] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:23] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) I was asked by SRE Infrastructure Foundations to ask you this, as a production alert has gone off because of this. [14:03:57] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10mepps) @Niharika Based on my read, it also looks like the 10 day delay would only be when there were holidays too. What's the next step... [14:04:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:05:17] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:42] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [14:07:51] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_puppetdb site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:09:16] (03PS5) 10Ottomata: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 [14:09:21] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudcephosd1014.mgmt reported down by icinga - https://phabricator.wikimedia.org/T289755 (10Cmjohnson) 05Open→03Resolved replaced the cable [14:09:59] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] (03CR) 10Ottomata: [C: 03+2] admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [14:11:18] RECOVERY - Host cloudcephosd1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms [14:12:19] (03PS3) 10Ottomata: service_auto_restart - match full line when ensuring absent [puppet] - 10https://gerrit.wikimedia.org/r/697605 [14:15:45] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) I have added a new 100GB disk so that the system has enough space to preform the vacume. this has meant doing the following * add new ga... [14:16:14] (03CR) 10Ottomata: [C: 03+2] service_auto_restart - match full line when ensuring absent [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata) [14:16:25] (03CR) 10Ottomata: [C: 03+2] "Tested in deployment-prep, nothing bad happened..." [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata) [14:16:48] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) [14:19:08] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) I think I can still approve these for Analytics access. Approved! [14:19:28] !log merged change to service_auto_restart.pp that changes the way service names are matched to be more explicit. tested in deployment prep and nothing bad happened. Logging in case something bad does happen in prod. https://gerrit.wikimedia.org/r/c/operations/puppet/+/697605 [14:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] (03CR) 10Jcrespo: "Hey, @Ottomata, I wanted to update this based on our own manual (specially, as you were on vacations), but if the director delegates this " [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [14:22:39] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Ottomata) I just tried to run puppet on an-coord1001 but got: ` Notice: Skipping run of Puppet configuration client; administratively disabled (Rea... [14:23:36] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10RhinosF1) Puppet is under maintenance [14:23:50] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Ottomata) Oh, sorry @jbond is doing some maintenance and referenced the wrong phab ticket. Ignore ^ [14:25:26] (03CR) 10Ottomata: "Thanks! I'll ask Olja what she thinks. I'm likely to be a faster reviewer on phab than she is. Could we put both of our names there?" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [14:29:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:29:48] !log Restarting CI Jenkins for plugins upgrade [14:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:59] !log enable puppet fleet wide to post preform puppetdb maintance T263578 [14:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:05] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [14:30:45] ottomata: fyi ^^^ puppet should be enabled again [14:31:32] ty [14:31:36] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:18] (03PS1) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) [14:53:26] (03CR) 10Effie Mouzeli: [C: 04-1] "Overall looks ok (by my eyes are not very experienced), some nits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [14:53:50] (03CR) 10Ahmon Dancy: icinga: add dancy,thcipriani,hashar to icinga authorized service/host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715735 (https://phabricator.wikimedia.org/T289746) (owner: 10Filippo Giunchedi) [14:54:52] (03CR) 10Effie Mouzeli: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [14:55:18] (03CR) 10Effie Mouzeli: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [14:56:04] (03PS2) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) [14:56:25] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) https://www.2ndquadrant.com/en/blog/postgresql-vacuum-and-analyze-best-practice-tips/ has some good advice on autovacum settings, this is... [14:56:31] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10herron) Thanks @ema! This is helpful feedback >>! In T289036#7320951, @ema wrote: > The diff step, `grr diff dashboardname`, is unclear to me. What is dashboa... [15:05:06] (03PS3) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) [15:05:33] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [15:06:25] (03PS1) 10Alexandros Kosiaris: Remove ocg remnant [labs/private] - 10https://gerrit.wikimedia.org/r/715744 [15:06:31] (03PS1) 10Alexandros Kosiaris: (WIP): Unify kubernetes users to automate user creation [labs/private] - 10https://gerrit.wikimedia.org/r/715745 [15:07:00] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving! Feel free to reopen [15:07:34] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30927/console" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:07:39] (03PS1) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) [15:09:15] (03PS2) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) [15:09:57] (03PS3) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) [15:10:30] (03CR) 10jerkins-bot: [V: 04-1] kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:11:15] (03PS4) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [15:13:39] (03PS4) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) [15:14:58] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Cmjohnson) 05Open→03Resolved Corrected all the duplicate cable ID's in eqiad. [15:15:35] 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) row A in eqiad has been updated [15:17:26] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) I am not sure what needs to be done with this task. There really isn't anything actionable other than to replace the scs with something else. [15:17:42] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10Cmjohnson) [15:18:10] 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10Cmjohnson) 05Open→03Resolved thanks @ayounsi there is another task for duplicate labels. That is all fixed. [15:18:15] (03PS5) 10Elukey: kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) [15:20:42] (03PS1) 10Jbond: Gemfile: add sync as a dependency [puppet] - 10https://gerrit.wikimedia.org/r/715751 [15:23:52] (03CR) 10Jbond: [C: 03+2] Gemfile: add sync as a dependency [puppet] - 10https://gerrit.wikimedia.org/r/715751 (owner: 10Jbond) [15:24:51] (03CR) 10Hnowlan: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714564 (owner: 10PipelineBot) [15:26:13] (03CR) 10Hashar: [C: 04-1] zuul: migrate cron of zuul_repack to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:27:46] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714564 (owner: 10PipelineBot) [15:28:53] (03PS2) 10Michael DiPietro: update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 [15:29:20] (03PS4) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) [15:30:05] (03PS5) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [15:31:31] (03PS6) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [15:32:24] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:36:35] (03PS7) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [15:37:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30931/console" [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:38:29] (03PS10) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [15:42:06] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:43:42] (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/890/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:45:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:45:48] (03PS3) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) [15:46:07] (03CR) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:46:27] (03CR) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:47:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:50:10] (03PS1) 10Hnowlan: maps: disable sync on maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/715754 [15:52:08] !log hnowlan@deploy1002 Started deploy [restbase/deploy@09156c2]: fix core Title redirect loop [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:38] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7264013, @Volans wrote: > - The `Admin::Hashuser` and `Admin::Hashgroup` seems to have tons of relations that I don't thin... [15:53:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Cmjohnson) [15:53:55] (03CR) 10Effie Mouzeli: "I suggest we first split this patch into 2, chart updates and helmfile.d updates, and we can review it again." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [15:54:09] 10SRE, 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10Cmjohnson) 05Open→03Resolved I am not sure if any of this is needed still but here is the info requeted. There are currently 2 available network ports and 135power ports available in C8 1 available networ... [15:54:23] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) [15:54:25] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) [15:55:41] (03CR) 10Effie Mouzeli: [C: 03+1] toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1600). [16:00:28] (03CR) 10Effie Mouzeli: [C: 03+1] "Please keep in mind that the staging cluster generally has limited resources 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse) [16:00:37] (03Abandoned) 10Kosta Harlan: bullseye-sssd: Add openssh-client [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan) [16:04:57] (03PS3) 10Dduvall: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) [16:05:16] (03CR) 10Michael DiPietro: "https://puppet-compiler.wmflabs.org/compiler1001/30934/" [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro) [16:05:58] (03CR) 10Dduvall: [C: 03+1] "Looking for a merge if anyone has time. This is blocking my testing of the gitlab-runner profile." [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:07:15] (03CR) 10Jbond: [C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:07:28] (03CR) 10Effie Mouzeli: [C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:07:56] jbond, effie: simultaneous thanks! ^ :) [16:08:01] effie: i think i just beet you :P [16:08:06] (03PS1) 10Volans: pylint: remove unnecessary disable comments [cookbooks] - 10https://gerrit.wikimedia.org/r/715756 [16:08:10] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@09156c2]: fix core Title redirect loop (duration: 16m 02s) [16:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:36] jbond: I was trying to understand how I +2'ed something and it got merged [16:08:43] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Adjust memory limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse) [16:08:46] I almost had a heart attack :p [16:09:29] dduvall: fyi if looking for a merge asking in #wikimedia-sre will normally find someone [16:09:32] :) [16:09:47] also fyi puppet also run on the apt servers [16:10:00] right on. i'm always trying to find better more polite ways to hound people for merges :) [16:10:57] dduvall: and feel free to ping me if you still have no luck and its in the EU timezone. failling everything else there is https://wikitech.wikimedia.org/wiki/Puppet_request_window [16:11:03] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) I made a mistake by an order of magnitude, we have backed up approximately 2.5TB or half a million of files in less th... [16:11:03] and yes i bet effie :D [16:11:18] haha [16:11:21] haha [16:11:27] (03Merged) 10jenkins-bot: rdf-streaming-updater: Adjust memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715531 (owner: 10DCausse) [16:11:51] (03CR) 10Volans: [C: 03+2] "Comments only, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/715756 (owner: 10Volans) [16:12:08] (03PS13) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [16:13:37] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) @Legoktm I'd like to check again because I may need to make a tweak in [[https://meta.wikimedia.org/wiki/Tech/Server_switch|t... [16:14:28] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:55] (03Merged) 10jenkins-bot: pylint: remove unnecessary disable comments [cookbooks] - 10https://gerrit.wikimedia.org/r/715756 (owner: 10Volans) [16:17:08] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10lmata) Hi @Papaul is it possible to ask for Bullseye with this ticket? thanks! [16:18:55] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:27] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10herron) [16:23:08] jbond: hmm, i don't see a `gitlab-runner` component yet under https://apt.wikimedia.org/wikimedia/dists/buster-wikimedia/thirdparty/ [16:25:23] dduvall: one sec i will need to run $something to do the initial sync [16:25:38] ah, ok [16:29:51] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: add wcqs.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/715570 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [16:30:04] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: create tls cert [puppet] - 10https://gerrit.wikimedia.org/r/715569 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [16:34:47] dduvall: also missed this ^^ (which i should have spotted in review) [16:34:51] (03PS1) 10Jbond: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715761 (https://phabricator.wikimedia.org/T287504) [16:34:57] ^^ even :) [16:35:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715761 (https://phabricator.wikimedia.org/T287504) (owner: 10Jbond) [16:36:48] ryankemper: fyi also merging b/files/ssl/wcqs.discovery.wmnet.crt [16:37:48] jbond: oooh, ok. thanks for the follow-up patch [16:37:49] jbond: much appreciated [16:37:51] * ryankemper got distracted [16:37:59] :) no problem [16:39:33] dduvall: https://apt.wikimedia.org/wikimedia/dists/buster-wikimedia/thirdparty/gitlab-runner/ is there now [16:39:54] jbond: \o/ and `apt-cache showpkg gitlab-runner` shows it [16:39:58] thanks! [16:40:04] great and no probs [16:45:10] (03PS1) 10Urbanecm: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) [16:49:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:51:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:39] (03PS14) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1700). [17:03:19] (03PS1) 10Jbond: realm.pp: update to use structured facts [puppet] - 10https://gerrit.wikimedia.org/r/715766 [17:03:36] (03PS15) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:04:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30935/console" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond) [17:04:19] (03CR) 10Volans: [C: 03+1] "LGTM, to be tested in cloud too to be sure" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond) [17:05:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30936/console" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond) [17:05:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] "noop in cloud and prod" [puppet] - 10https://gerrit.wikimedia.org/r/715766 (owner: 10Jbond) [17:06:17] (03PS16) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:10:13] (03CR) 10Wolfgang Kandek: [C: 03+1] "Approved, excellent for Arnold's progress in onboarding." [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [17:10:53] (03PS17) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:17:10] (03PS18) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:21:52] (03CR) 10RLazarus: [C: 03+1] "Intent LGTM -- implementation looks good too but I don't know this code well. :)" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [17:24:06] (03PS19) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:34:01] (03PS20) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:36:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:30] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:38:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:44:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:55:38] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:55:46] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1800) [18:01:28] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:01:34] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:04] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10odimitrijevic) Approved! Apologies for the delay. [18:03:32] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10odimitrijevic) Approved! [18:04:07] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10odimitrijevic) Approved. [18:05:46] !log re-pool eqsin-codfw link [18:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:09:16] (03CR) 10ODimitrijevic: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [18:09:48] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Legoktm) >>! In T287546#7322191, @sgrabarczuk wrote: > @Legoktm I'd like to check again because I may need to make a tweak in [[https://me... [18:10:15] (03CR) 10ODimitrijevic: [C: 03+1] "Btw, agree to have both of us as approvers. This was not on my radar and now that it is I will be paying attention to the timely approvals" [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [18:15:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:03] (03PS1) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) [18:19:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:16] (03CR) 10jerkins-bot: [V: 04-1] Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [18:21:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:43] !log ran `scap prep 1.37.0-wmf.21` and `scap apply-patches --train 1.37.0-wmf.21` refs T281162 [18:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:48] T281162: 1.37.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T281162 [18:25:54] (03PS2) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) [18:27:11] (03CR) 10jerkins-bot: [V: 04-1] Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [18:28:41] (03PS3) 10Ladsgroup: Absent wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) [18:28:55] (03PS1) 1020after4: testwikis wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773 [18:28:57] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773 (owner: 1020after4) [18:30:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715773 (owner: 1020after4) [18:30:05] !log twentyafterfour@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.21 refs T281161 [18:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:10] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [18:34:30] (03CR) 10Bstorm: [C: 03+2] cloud osmdb: set num_threads in the sync job [puppet] - 10https://gerrit.wikimedia.org/r/715623 (https://phabricator.wikimedia.org/T285668) (owner: 10Bstorm) [18:34:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:18] (03PS2) 10Bstorm: cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668) [18:35:20] PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:26] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:34] PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:34] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:52] hmm [18:36:07] wth [18:36:20] PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:29] I don't know what those hosts do but that looks bad. [18:36:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:36:49] I'm going to guess this isn't related to train ... given that I haven't deployed to anything yet [18:36:59] nod. [18:37:11] it's just syncing masters right now [18:37:36] RECOVERY - Host cp5014 is UP: PING WARNING - Packet loss = 71%, RTA = 222.91 ms [18:37:46] how nice [18:37:48] RECOVERY - Host cp5011 is UP: PING WARNING - Packet loss = 50%, RTA = 291.12 ms [18:37:48] RECOVERY - Host cp5006 is UP: PING WARNING - Packet loss = 66%, RTA = 292.37 ms [18:37:48] RECOVERY - Host doh5001 is UP: PING WARNING - Packet loss = 75%, RTA = 223.54 ms [18:37:52] RECOVERY - Host cp5003 is UP: PING OK - Packet loss = 0%, RTA = 236.68 ms [18:37:55] nothing to see here folks [18:38:32] (03PS1) 10Ssingh: durum: add test information to the results [puppet] - 10https://gerrit.wikimedia.org/r/715776 [18:39:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30937/console" [puppet] - 10https://gerrit.wikimedia.org/r/715776 (owner: 10Ssingh) [18:40:46] PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:58] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:58] PROBLEM - Host cp5003 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:07] (03PS6) 10Legoktm: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:41:16] RECOVERY - Host cp5014 is UP: PING WARNING - Packet loss = 75%, RTA = 222.74 ms [18:41:24] PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:24] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:32] RECOVERY - Host cp5006 is UP: PING WARNING - Packet loss = 90%, RTA = 293.36 ms [18:41:32] RECOVERY - Host cp5003 is UP: PING WARNING - Packet loss = 71%, RTA = 236.60 ms [18:41:34] RECOVERY - Host cp5011 is UP: PING OK - Packet loss = 0%, RTA = 291.14 ms [18:41:34] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 222.97 ms [18:41:52] (03PS7) 10Legoktm: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:42:37] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add test information to the results [puppet] - 10https://gerrit.wikimedia.org/r/715776 (owner: 10Ssingh) [18:44:16] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30939/console" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:44:18] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777 [18:47:11] (03CR) 10Legoktm: [V: 03+1] "Do we need to ensure => absent first, or can it just be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:47:47] (03PS3) 10Legoktm: mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:54:21] (03CR) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [18:54:50] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10sbassett) >>! In T288844#7321649, @mepps wrote: > It sounds like @sbassett is moving forward with looking into this. Er, whoops, I'm a... [18:56:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:59:22] Tons of production errors right now. [19:00:00] * dancy checks the source host [19:00:04] twentyafterfour and dancy: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T1900). [19:00:51] mw2296. thwikisource. [19:01:06] only that wiki/host dancy ? [19:01:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:02:36] answering myself: looks so [19:03:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:05:58] !log twentyafterfour@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.21 refs T281161 (duration: 35m 53s) [19:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [19:07:21] (03CR) 10Andrew Bogott: [C: 03+1] update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro) [19:07:29] that's odd. [19:07:39] now moved to mw2318 [19:07:57] and mw2251 [19:08:19] those three hosts only [19:08:55] 50k errors is crazy for thwikisource [19:09:41] and it's on wmf.20 not 21 [19:09:51] and that wiki normally only has 6k views per day [19:10:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:10:38] twentyafterfour: were there any warnings in scap about any of the three hosts this happened on? [19:10:57] no nothing [19:11:17] hmm... [19:11:57] and the deployment was just for wmf.21 (though there could have been unsynced change that inadvertantly got synced with the train?) [19:12:09] since the testwiki deployment does sync-world [19:12:36] that alone wouldn't explain a) why it happens only on three servers b) why it happens on such low-traffic wiki only [19:12:54] yeah that part I don't know [19:13:23] /srv/mediawiki/php-1.37.0-wmf.20/extensions/Scribunto/includes/common/ApiScribuntoConsole.php(102): Scribunto_LuaEngine->runConsole(array) [19:13:41] someone sending a lot of crazy input into console? [19:13:49] yeah ... [19:13:52] let me check [19:14:53] Scribunto_LuaSandboxInterpreter->callFunction(LuaSandboxFunction, LuaSandboxFunction, LuaSandboxFunction) [19:15:23] it's not happening anymore [19:15:43] i'm still personally curious why it happened at all though :) [19:15:59] 50,004 errors [19:16:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:16:11] I think someone created an infinite recursion or something [19:16:27] or just a loop over 50k items [19:17:01] yeah [19:17:37] I'm not even sure that this warning should be showing up in the production errors logstash dashboard... [19:17:52] well it's a PHP warning, so...yes :) [19:18:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:18:01] What's the actual exception/warning? [19:18:08] PHP Warning: mb_substr() expects parameter 2 to be integer, float given [19:18:13] [c404b613-22e3-443e-b4ec-24a4082e2137] /w/api.php PHP Warning: mb_substr() expects parameter 2 to be integer, float given [19:18:14] [19:18:17] from /srv/mediawiki/php-1.37.0-wmf.20/extensions/Scribunto/includes/engines/LuaCommon/UstringLibrary.php(319) [19:18:20] 50k times at thwikisource [19:18:28] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1088.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:18:42] Probably a bug in Scribunto's parameter validation then [19:18:53] likely [19:19:24] (03CR) 10Bstorm: [C: 03+2] cloud osmdb: don't use proxy for cloud [puppet] - 10https://gerrit.wikimedia.org/r/715624 (https://phabricator.wikimedia.org/T285668) (owner: 10Bstorm) [19:20:25] https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/UstringLibrary.php#319 [19:20:47] at line 304 it checks for 'number' not 'int' [19:20:50] !log gitlab1001: brief downtime for testing reconfiguration of cas3.session_duration [19:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:09] twentyafterfour: i think that's correct. https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/LibraryBase.php#141 compares that with lua type, not php type. https://gerrit.wikimedia.org/g/mediawiki/extensions/Scribunto/+/a8ef8791cdd7e19a47243e27e9236d7777a01717/includes/engines/LuaCommon/LibraryBase.php#106 maps int to number. [19:22:38] hmmm [19:23:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:06] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) Confirmed working for a couple of us, thanks again. [19:27:23] I'd say just casting to int wouldn't be the worst idea, when the parameter is_numeric [19:30:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:34:52] (03PS1) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [19:35:36] (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [19:37:14] (03PS2) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [19:39:47] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) @lmata will do [19:42:48] (03PS2) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [19:44:55] (03PS3) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [19:45:10] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-htriedman-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:00] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:07] (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [19:49:55] (03PS4) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [19:51:57] (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [19:56:03] (03CR) 10Bstorm: [C: 03+2] P::toolforge::redis_sentinel: Block REPLICAOF too [puppet] - 10https://gerrit.wikimedia.org/r/715703 (owner: 10Majavah) [19:57:01] I guess it's probably good to deploy to group 0? I don't see anything terrible happening [19:57:30] looks ok [19:57:46] (03PS1) 1020after4: group0 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785 [19:57:48] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785 (owner: 1020after4) [19:58:51] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715785 (owner: 1020after4) [20:00:30] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.21 refs T281161 [20:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:36] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [20:11:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:54] (03CR) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [20:16:01] (03PS2) 10Legoktm: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 [20:16:03] (03PS3) 10Legoktm: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 [20:16:05] (03PS3) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) [20:17:53] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) cloudcephosd1021 C8 u31. port 0/1 cableid 11034/11032 cloudsw2-c8-eqiad cloudcephosd1022 C8 u32. port 2/3 cableid 11033/11031 cloudsw... [20:18:16] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) [20:18:37] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:00] (03PS4) 10Legoktm: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) [20:38:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:47:03] (03CR) 10Legoktm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/30943/" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:48:14] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30944/console" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [20:51:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:59:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:00:26] (03CR) 10Dduvall: [C: 03+1] "Cherry picked on gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud and successfully tested on runner-1002." [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [21:07:06] (03CR) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [21:19:36] (03PS4) 10BryanDavis: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) [21:19:38] (03PS3) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [21:19:40] (03PS6) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [21:20:46] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:21:59] (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [21:29:29] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777 (owner: 10PipelineBot) [21:30:15] (03CR) 10BryanDavis: [C: 03+2] "PS4 is trivial rebasing changes of PS3 which got a +1 from Effie." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [21:32:49] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715777 (owner: 10PipelineBot) [21:33:05] (03Merged) 10jenkins-bot: toolhub: Set pod requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/715604 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [21:34:26] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [21:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:51] !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:39] !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [21:48:43] (03CR) 10Cwhite: [C: 03+1] "One comment inline, otherwise this looks good! Thanks!" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [21:52:24] (03CR) 10Cwhite: [C: 03+1] thanos: add thanos::recording_rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [21:55:22] (03CR) 10Cwhite: [C: 03+1] "Excellent commit message. It clearly outlined the problem and at what stage of resolution this is." [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [22:04:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:06:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:44] (03CR) 10Legoktm: [C: 03+2] mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [22:18:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:29:48] (03PS1) 10Bartosz Dziewoński: Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804 [22:35:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:39:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:42:34] !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b]: Regular analytics weekly train v0.1.17 [analytics/refinery@a0f039b] [22:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:37] (03PS4) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [22:45:39] (03PS7) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [22:47:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:47:25] (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [22:52:50] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:53:02] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:53:24] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:54:18] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:58:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210831T2300). [23:00:05] dpifke and MatmaRex: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:06] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:00:14] !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b]: Regular analytics weekly train v0.1.17 [analytics/refinery@a0f039b] (duration: 17m 39s) [23:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:32] hiii [23:00:36] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:00:37] Hi MatmaRex [23:00:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:43] And hi dpifke [23:00:43] !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b] (thin): Regular analytics weekly train THIN v0.1.17 [analytics/refinery@a0f039b] [23:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:48] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:00:50] !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b] (thin): Regular analytics weekly train THIN v0.1.17 [analytics/refinery@a0f039b] (duration: 00m 07s) [23:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:10] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:01:12] !log mforns@deploy1002 Started deploy [analytics/refinery@a0f039b] (hadoop-test): Regular analytics weekly train TEST v0.1.17 [analytics/refinery@a0f039b] [23:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:21] (03CR) 10Urbanecm: [C: 03+2] Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804 (owner: 10Bartosz Dziewoński) [23:02:07] (03Merged) 10jenkins-bot: Fix wgDiscussionTools_sourcemodetoolbar settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715804 (owner: 10Bartosz Dziewoński) [23:03:13] MatmaRex: available at mwdebug2001, please review [23:03:43] looking [23:04:57] yeah, seems as expected [23:05:07] great, syncing [23:05:29] i got distracted by the fact that ko.wikipedia apparently has non-monospace font in the editor [23:05:58] :) [23:06:12] (03PS2) 10Urbanecm: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) [23:06:16] (03CR) 10Urbanecm: [C: 03+2] Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [23:07:10] (03Merged) 10jenkins-bot: Enable link recommendation frontent in dewiki and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715763 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [23:07:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8997ae5d0b998839853aed2b246f5c88fe9d83eb: Fix wgDiscussionTools_sourcemodetoolbar settings (duration: 01m 22s) [23:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:24] MatmaRex: should be live. Enjoy! [23:07:34] thanks [23:08:00] any time. [23:08:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:59] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1437d99c1884c0695f02b81b724ec82a2bd3362e: Enable link recommendation frontent in dewiki and nlwiki (T288420, T285254) (duration: 01m 06s) [23:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:03] T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254 [23:09:03] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [23:09:08] dpifke: hi, do you want to self-deploy? [23:09:12] (if so, go ahead) [23:14:30] urbanecm: Yes, doing now. (Sorry, got pulled away for a bit.) [23:14:55] !log mforns@deploy1002 Finished deploy [analytics/refinery@a0f039b] (hadoop-test): Regular analytics weekly train TEST v0.1.17 [analytics/refinery@a0f039b] (duration: 13m 42s) [23:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:14] dpifke: np. So, I'm disconnecting from prod and leaving you to do your stuff :-) [23:15:32] (03CR) 10Dave Pifke: [C: 03+2] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [23:15:49] !log failed deployment of refinery (v0.1.17) to an-test-coord1001.eqiad.wmnet (scap error) [23:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:19] (03Merged) 10jenkins-bot: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [23:17:44] Going to test on mwdebug2001 first. [23:22:40] Looks OK, pushing further. [23:23:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:45] !log dpifke@deploy1002 scap failed: average error rate on 3/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [23:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:11] Looking in Logstash... [23:28:04] (03CR) 10BryanDavis: toolhub: Add helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [23:29:13] Not sure why that looked good on mwdebug, it's broken. Reverting. [23:30:01] (03PS1) 10Dave Pifke: Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807 [23:30:23] (03CR) 10Dave Pifke: [C: 03+2] Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807 (owner: 10Dave Pifke) [23:31:07] (03Merged) 10jenkins-bot: Revert "profiler: use seperate pipeline inside k8s pods" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715807 (owner: 10Dave Pifke) [23:31:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:02] !log dpifke@deploy1002 Synchronized wmf-config/profiler.php: Revert excimer-k8s pipelines T288165 (duration: 01m 14s) [23:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:06] T288165: Create separate ArcLamp pipeline for k8s-mwdebug - https://phabricator.wikimedia.org/T288165 [23:33:48] OK, I'm done for today. Will debug the patch and try again tomorrow. [23:35:13] (03CR) 10BryanDavis: toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [23:37:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:38:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:41:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log