[00:11:03] (03PS3) 10Juan90264: Repair the size of the logo of Kashmiri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731231 [00:30:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:49] (03PS4) 10Juan90264: Repair the size of the logo of Kashmiri Wikipedia and Kashmiri Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731231 (https://phabricator.wikimedia.org/T293373) [00:50:59] PROBLEM - Disk space on aqs1013 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-b 134322 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1013&var-datasource=eqiad+prometheus/ops [01:37:36] (03PS5) 10Juan90264: Repair the size of the logo of Kashmiri Wikipedia and Kashmiri Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731231 (https://phabricator.wikimedia.org/T293373) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211017T0700) [08:39:01] PROBLEM - Check systemd state on aqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:21] PROBLEM - cassandra-b CQL 10.64.32.147:9042 on aqs1013 is CRITICAL: connect to address 10.64.32.147 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [08:39:29] PROBLEM - cassandra-b service on aqs1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:41:05] RECOVERY - Check systemd state on aqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:33] RECOVERY - cassandra-b service on aqs1013 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:17] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:12:11] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: mailman3 encoding issues on unsubscription emails - https://phabricator.wikimedia.org/T290613 (10MarcoAurelio) Sorry for the late reply. Mi client is gmail web interface and his was Yahoo. The mail with the weird encoding was however received from lists.wikimedia.org... [13:42:35] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 352 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:45] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:45:45] PROBLEM - cassandra-b service on aqs1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:45:51] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:45:51] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:45:55] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:45:55] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:27] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:46:27] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:27] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:46:27] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:29] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:46:29] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:33] PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:41] PROBLEM - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is CRITICAL: connect to address 10.64.32.145 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:47:23] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by media [15:47:23] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:59:53] RECOVERY - cassandra-b service on aqs1012 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:00:39] RECOVERY - Check systemd state on aqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:35] these are new AQS nodes, not yet serving traffic, going to downtime them [16:08:16] done [16:18:25] (03PS1) 10Elukey: profile::query_service::monitor::wikidata: update streaming lag monitor [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) [16:19:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31727/console" [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [16:25:42] (03CR) 10Elukey: [V: 03+1] "I wanted to merge this to fix the monitors, but the current values of the metric vs the current thresholds seem inconsistent. If we merge " [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [17:12:24] (03CR) 10DCausse: profile::query_service::monitor::wikidata: update streaming lag monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [17:53:35] (03PS1) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) [17:55:57] (03PS2) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) [17:57:42] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:20:42] (03PS3) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) [18:21:45] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:24:13] (03PS4) 10Ladsgroup: lists: Split ferm and monitoring of profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) [18:30:35] (03PS2) 10Elukey: profile::query_service::monitor::wikidata: update streaming lag monitor [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) [18:31:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31728/console" [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [18:31:53] (03CR) 10Elukey: [V: 03+1] profile::query_service::monitor::wikidata: update streaming lag monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [18:40:38] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:42:22] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/1030/" [puppet] - 10https://gerrit.wikimedia.org/r/731286 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [18:51:21] (03PS1) 10Ideophagous: updated arywiki namespaces as per T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 [19:52:34] Hey all - was going to try to deploy a security patch for T293556 soon. I'd let it sit for tomorrow's window, but I'd rather not let it go any longer. [20:03:53] (03CR) 10DCausse: [C: 03+1] "thanks! :)" [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [20:08:23] ...and my patch for T293556 didn't fix the issue. I've reverted on deployment and mwdebug1002. Patch was never deployed. [20:11:53] (03CR) 10Ryan Kemper: [C: 03+2] profile::query_service::monitor::wikidata: update streaming lag monitor [puppet] - 10https://gerrit.wikimedia.org/r/731282 (https://phabricator.wikimedia.org/T288231) (owner: 10Elukey) [20:51:33] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3.6e+06 ge (W)1.2e+06 ge 7.205e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:57:05] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10AlexisJazz) [21:57:13] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10AlexisJazz) [22:04:34] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021. - https://phabricator.wikimedia.org/T293251 (10AlexisJazz) [22:04:50] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: Beta cluster certificates have expired - https://phabricator.wikimedia.org/T262806 (10AlexisJazz)