[00:16:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:16] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:31:46] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:52:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:09:40] PROBLEM - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is CRITICAL: connect to address 10.64.48.65 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:11:54] RECOVERY - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is OK: TCP OK - 0.000 second response time on 10.64.48.65 port 9042 https://phabricator.wikimedia.org/T93886 [01:15:00] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{refe [01:15:00] ent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:17:08] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:23:10] (03PS1) 10Zabe: Add a logo for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750805 (https://phabricator.wikimedia.org/T298438) [01:25:14] (03PS1) 10Zabe: Add a logo for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750826 (https://phabricator.wikimedia.org/T298439) [01:34:40] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:56:02] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:00:30] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [03:09:02] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:04] (03PS2) 10KartikMistry: Set ContentTranslationContentImportForSectionTranslation for SX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) [03:23:46] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [03:28:08] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.001 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [03:37:02] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:59:28] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:13:20] PROBLEM - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [04:21:01] !log start of running populating actor in revision table on rest of sections. It will take two months to finish (T275246) [04:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:06] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [04:29:02] RECOVERY - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is OK: TCP OK - 0.000 second response time on 10.64.32.128 port 9042 https://phabricator.wikimedia.org/T93886 [04:55:58] (03PS1) 10Ladsgroup: Full roll out of wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) [05:00:36] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:42] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:08] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [05:14:14] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [05:16:30] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:20] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:28] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:30] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:17:48] PROBLEM - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.206 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [05:17:48] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:18:10] PROBLEM - cassandra-b service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:18:18] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:58] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:31:42] RECOVERY - cassandra-b service on aqs1011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:31:52] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:38] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:34:36] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:34:48] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:22] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:42] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:42] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:35:52] RECOVERY - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.206 port 9042 https://phabricator.wikimedia.org/T93886 [05:47:07] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Marostegui) a:03Papaul @Papaul can we get a replacement disk? Thanks! [06:10:26] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) I would prefer either `mainstash` or `wikishared`, but I don't have any strong opinions about any of... [06:12:40] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:17:44] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [06:19:50] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [06:24:30] PROBLEM - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [06:28:48] RECOVERY - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is OK: TCP OK - 0.000 second response time on 10.64.32.128 port 9042 https://phabricator.wikimedia.org/T93886 [06:53:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db[2077,2095].codfw.wmnet with reason: Maintenance [06:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db[2077,2095].codfw.wmnet with reason: Maintenance [06:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:32] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [06:53:57] (03CR) 10Ladsgroup: Full roll out of wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [06:54:23] (03PS2) 10Ladsgroup: Full roll out of wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) [06:55:57] (03CR) 10Ladsgroup: [C: 03+2] Full roll out of wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [06:56:45] (03Merged) 10jenkins-bot: Full roll out of wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750831 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [07:00:28] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:750831|Full roll out of wgMaxExecutionTimeForExpensiveQueries (T297708)]], Part I (duration: 00m 58s) [07:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:32] T297708: Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 [07:01:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:08] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:750831|Full roll out of wgMaxExecutionTimeForExpensiveQueries (T297708)]], Part I (duration: 01m 20s) [07:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:02] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Ladsgroup) [07:09:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] puppetmaster::gitsync: remove absented crons and logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/750251 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:09:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2086.codfw.wmnet with reason: Maintenance [07:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2086.codfw.wmnet with reason: Maintenance [07:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2087.codfw.wmnet with reason: Maintenance [07:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2087.codfw.wmnet with reason: Maintenance [07:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:14] 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T298459 (10ayounsi) p:05Triage→03Medium [07:43:14] PROBLEM - graphite.wikimedia.org api on graphite2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.109 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:45:26] RECOVERY - graphite.wikimedia.org api on graphite2003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:46:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:08] !log draining primary and secondary instances off ganeti2023 [07:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:12] PROBLEM - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:54:18] RECOVERY - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is OK: TCP OK - 0.005 second response time on 10.64.32.128 port 9042 https://phabricator.wikimedia.org/T93886 [07:54:40] PROBLEM - Ganeti memory on ganeti2027 is CRITICAL: CRIT Memory 97% used. Largest process: qemu-system-x86 (38175) = 25.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [07:56:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:01:28] (03PS1) 10Muehlenhoff: Remove accesss for josepita [puppet] - 10https://gerrit.wikimedia.org/r/751065 [08:02:59] (03PS1) 10Marostegui: Revert "pc2014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/750813 [08:03:38] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/750813 (owner: 10Marostegui) [08:04:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [08:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove accesss for josepita [puppet] - 10https://gerrit.wikimedia.org/r/751065 (owner: 10Muehlenhoff) [08:17:00] PROBLEM - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:25:02] !log installing zziplib security updates [08:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:36] RECOVERY - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is OK: TCP OK - 3.046 second response time on 10.64.32.128 port 9042 https://phabricator.wikimedia.org/T93886 [08:28:56] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [08:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [08:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:05] (03PS1) 10Jelto: deployment_server: remove obsolete value helmBinary [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) [08:47:19] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [08:49:30] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33091/console" [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:49:38] !log installing libpcap security updates [08:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove special slaves from s2 codfw T263127', diff saved to https://phabricator.wikimedia.org/P18267 and previous config saved to /var/cache/conftool/dbconfig/20220103-085428-marostegui.json [08:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:31] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [08:57:59] (03PS6) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [08:58:01] (03PS3) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [08:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions and logpager from s2 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18268 and previous config saved to /var/cache/conftool/dbconfig/20220103-085824-marostegui.json [08:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:48] (03CR) 10jerkins-bot: [V: 04-1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [08:58:56] (03CR) 10jerkins-bot: [V: 04-1] Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto) [08:59:45] (03CR) 10Giuseppe Lavagetto: Introduce the ClusterConfig class (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [09:00:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:29] (03PS7) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [09:04:31] (03PS4) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [09:05:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:42] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [09:14:00] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [09:15:49] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10ayounsi) Great thanks! I updated Netbox to reflect reality (as required so automation can work), and pushed its initial config. Could you connect the mgmt port (em0) to ge-0/0/0 (to itself). Then we... [09:16:12] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) System Event Log shows a failure on DIMM A1: ` ------------------------------------------------------------------------------- Record: 49 Date/Time: 12/24/2021 03:32:47 Sourc... [09:24:40] !log installing djvulibre security updates on buster [09:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::base: introduce class memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/749523 (owner: 10Giuseppe Lavagetto) [09:29:04] (03PS1) 10Jelto: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) [09:29:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T277354)', diff saved to https://phabricator.wikimedia.org/P18269 and previous config saved to /var/cache/conftool/dbconfig/20220103-093003-marostegui.json [09:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:06] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:40:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:04] (03PS6) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [09:59:09] !log installing ruby2.3 security updates [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:37] (03PS1) 10David Caro: alternatives::install: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751072 (https://phabricator.wikimedia.org/T272559) [10:04:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:11:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T297094)', diff saved to https://phabricator.wikimedia.org/P18270 and previous config saved to /var/cache/conftool/dbconfig/20220103-101116-marostegui.json [10:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:19] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [10:11:36] (03PS1) 10David Caro: apparmor::hardlink: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751073 (https://phabricator.wikimedia.org/T272559) [10:12:26] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:15:19] (03PS1) 10Giuseppe Lavagetto: Copy hiera private data for merging Id10dbe7d244ab9b8 [labs/private] - 10https://gerrit.wikimedia.org/r/751074 [10:19:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host build2001.codfw.wmnet [10:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:09] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Copy hiera private data for merging Id10dbe7d244ab9b8 [labs/private] - 10https://gerrit.wikimedia.org/r/751074 (owner: 10Giuseppe Lavagetto) [10:20:36] !log powercycle an-worker1120 (CPU soft lockup errors in mgmt console) [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33093/console" [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [10:21:34] 10SRE, 10Infrastructure-Foundations: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 (10MoritzMuehlenhoff) [10:22:39] !log powercycle an-worker1114 (CPU soft lockup errors in mgmt console) [10:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:00] RECOVERY - Host an-worker1120 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:24:39] (03PS2) 10Jelto: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) [10:24:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [10:25:06] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:44] RECOVERY - SSH on an-worker1120 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:27:28] RECOVERY - Host an-worker1114 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:28:24] RECOVERY - puppet last run on an-worker1120 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:29:42] (03PS1) 10David Caro: aptrepo::distribution: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751078 (https://phabricator.wikimedia.org/T272559) [10:30:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297094)', diff saved to https://phabricator.wikimedia.org/P18271 and previous config saved to /var/cache/conftool/dbconfig/20220103-103116-marostegui.json [10:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:20] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [10:32:01] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [10:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:29] jouncebot: next [10:32:29] In 1 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1200) [10:32:36] ok I have time [10:35:37] (03CR) 10Muehlenhoff: [C: 03+1] "This was in fact only needed for the old jessie parsoid debs and I don't believe we'll need that functionality for anything else in the ne" [puppet] - 10https://gerrit.wikimedia.org/r/751078 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:39:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked from s2 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18272 and previous config saved to /var/cache/conftool/dbconfig/20220103-103909-marostegui.json [10:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:41:55] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:47] (03CR) 10David Caro: [C: 03+2] aptrepo::distribution: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751078 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:44:24] (03CR) 10Hashar: [C: 03+1] alternatives::install: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751072 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:46:01] (03PS1) 10David Caro: b:h:j:metatstore: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751080 (https://phabricator.wikimedia.org/T272559) [10:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18273 and previous config saved to /var/cache/conftool/dbconfig/20220103-104621-marostegui.json [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:51] (03PS2) 10David Caro: b:h:j:{metatstore,server}: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751080 (https://phabricator.wikimedia.org/T272559) [10:50:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:51:37] jouncebot: next [10:51:37] In 1 hour(s) and 8 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1200) [10:51:54] I'm about to reboot deploy1002 in order to install docker properly [10:54:18] (03PS1) 10David Caro: b:hadoop:httpfs: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751082 (https://phabricator.wikimedia.org/T272559) [10:55:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:57:06] (03PS1) 10David Caro: bigtop:spark: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751083 (https://phabricator.wikimedia.org/T272559) [10:57:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:59:58] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:00:21] the node is depooled --^ [11:00:29] (03PS1) 10David Caro: c:kafka:broker:jmxtrans: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751085 (https://phabricator.wikimedia.org/T272559) [11:01:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:01:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18274 and previous config saved to /var/cache/conftool/dbconfig/20220103-110126-marostegui.json [11:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:10] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:05:43] (03PS1) 10David Caro: c:kafka::mirrors: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751086 (https://phabricator.wikimedia.org/T272559) [11:06:39] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:08:06] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:09:33] jouncebot: next [11:09:33] In 0 hour(s) and 50 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1200) [11:11:20] (03PS1) 10David Caro: elasticsearch:decommission: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751088 (https://phabricator.wikimedia.org/T272559) [11:11:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:13:41] (03PS1) 10David Caro: e:sevice:consumer: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751089 (https://phabricator.wikimedia.org/T272559) [11:14:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T277354)', diff saved to https://phabricator.wikimedia.org/P18275 and previous config saved to /var/cache/conftool/dbconfig/20220103-111457-marostegui.json [11:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:01] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:15:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:16:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297094)', diff saved to https://phabricator.wikimedia.org/P18276 and previous config saved to /var/cache/conftool/dbconfig/20220103-111631-marostegui.json [11:16:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:16:34] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [11:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T297094)', diff saved to https://phabricator.wikimedia.org/P18277 and previous config saved to /var/cache/conftool/dbconfig/20220103-111638-marostegui.json [11:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:42] !log oblivian@cumin2002 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet [11:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:23] (03PS1) 10David Caro: geoip:data:package: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751091 (https://phabricator.wikimedia.org/T272559) [11:22:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:26:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297094)', diff saved to https://phabricator.wikimedia.org/P18278 and previous config saved to /var/cache/conftool/dbconfig/20220103-112617-marostegui.json [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:21] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [11:26:26] (03PS1) 10David Caro: html5depurate: remove unused role and modules [puppet] - 10https://gerrit.wikimedia.org/r/751093 (https://phabricator.wikimedia.org/T272559) [11:27:10] !log oblivian@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1002.eqiad.wmnet [11:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:18] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:27:36] PROBLEM - cassandra-b CQL 10.64.48.69:9042 on aqs1015 is CRITICAL: connect to address 10.64.48.69 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:27:50] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases-primary.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:55] (03PS1) 10Muehlenhoff: Record extended account date for sannita [puppet] - 10https://gerrit.wikimedia.org/r/751094 [11:29:20] (03PS1) 10David Caro: icinga:nsca:client: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751095 (https://phabricator.wikimedia.org/T272559) [11:29:48] !log restart cassandra-b on aqs1010 and aqs1015 (instances stuck / trashing, new cluster, not serving live traffic atm) [11:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:50] RECOVERY - cassandra-b CQL 10.64.48.69:9042 on aqs1015 is OK: TCP OK - 0.000 second response time on 10.64.48.69 port 9042 https://phabricator.wikimedia.org/T93886 [11:29:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:29:58] (03PS2) 10Giuseppe Lavagetto: profile:k8s::deployment_server::mediawiki: split in subprofiles [puppet] - 10https://gerrit.wikimedia.org/r/749552 (https://phabricator.wikimedia.org/T297673) [11:30:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P18279 and previous config saved to /var/cache/conftool/dbconfig/20220103-113002-marostegui.json [11:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:06] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:30] (03CR) 10Muehlenhoff: [C: 03+2] Record extended account date for sannita [puppet] - 10https://gerrit.wikimedia.org/r/751094 (owner: 10Muehlenhoff) [11:31:50] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:44] (03PS1) 10David Caro: identd: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751098 (https://phabricator.wikimedia.org/T272559) [11:33:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/751093 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:34:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/751091 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:35:24] (03PS1) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) [11:35:52] (03PS2) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) [11:36:00] (03PS3) 10Giuseppe Lavagetto: profile:k8s::deployment_server::mediawiki: split in subprofiles [puppet] - 10https://gerrit.wikimedia.org/r/749552 (https://phabricator.wikimedia.org/T297673) [11:36:02] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) [11:37:04] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [11:37:06] (03CR) 10jerkins-bot: [V: 04-1] kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [11:37:40] RECOVERY - Ganeti memory on ganeti2027 is OK: OK Memory 71% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:37:42] (03PS3) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) [11:37:51] !log rebalance row_A ganeti group in codfw (to allow to eventually free 2023 of instances) [11:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:03] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33095/console" [puppet] - 10https://gerrit.wikimedia.org/r/749552 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [11:38:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:38:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] profile:k8s::deployment_server::mediawiki: split in subprofiles [puppet] - 10https://gerrit.wikimedia.org/r/749552 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [11:40:37] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [11:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18280 and previous config saved to /var/cache/conftool/dbconfig/20220103-114122-marostegui.json [11:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) There's a few classes which are currently unused, but which should be kept nonetheless (like the unused debian::codename:: ones), maybe we... [11:42:53] (03CR) 10Zabe: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 (owner: 10Majavah) [11:43:04] (03PS1) 10David Caro: initramfs:hook: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751102 (https://phabricator.wikimedia.org/T272559) [11:43:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:45:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P18281 and previous config saved to /var/cache/conftool/dbconfig/20220103-114507-marostegui.json [11:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) >>! In T272559#7593881, @MoritzMuehlenhoff wrote: > There's a few classes which are currently unused, but which should be kept nonetheless (like the u... [11:49:05] (03PS3) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) [11:49:36] (03PS1) 10David Caro: labs_lvm:swap: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751103 (https://phabricator.wikimedia.org/T272559) [11:50:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:52:41] (03PS1) 10Elukey: admin: allow all Analytics/DE members to manage cassandra on AQS [puppet] - 10https://gerrit.wikimedia.org/r/751104 [11:52:52] (03PS4) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) [11:53:03] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [11:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist from s2 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18282 and previous config saved to /var/cache/conftool/dbconfig/20220103-115403-marostegui.json [11:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:07] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [11:54:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:55:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:56:06] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:56:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18283 and previous config saved to /var/cache/conftool/dbconfig/20220103-115627-marostegui.json [11:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:58:24] (03PS2) 10Elukey: admin: allow all Analytics/DE members to manage cassandra on AQS [puppet] - 10https://gerrit.wikimedia.org/r/751104 [11:58:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1200). [12:00:05] MdsShakil, kart_, taavi, and zabe: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T277354)', diff saved to https://phabricator.wikimedia.org/P18284 and previous config saved to /var/cache/conftool/dbconfig/20220103-120011-marostegui.json [12:00:12] hey, I can deploy today [12:00:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:00:15] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:20] hey o/ [12:00:21] First of the year :) [12:00:26] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 220 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:00:44] kart_: hi, around? [12:01:00] * kart_ is here. [12:01:09] (03CR) 10Jbond: [C: 03+1] "LGTM from a spicerack PoV" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [12:01:10] I assume you want to self-service? [12:01:12] !log installing wireshark security updates on stretch [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:19] taavi: Yeah. I can self-deploy. [12:01:30] sure, please ping me when done then [12:02:15] (03PS1) 10David Caro: labs_debrepo: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751105 (https://phabricator.wikimedia.org/T272559) [12:02:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:02:50] MdShakil around? First patch is yours. [12:03:30] (03PS3) 10KartikMistry: Set ContentTranslationContentImportForSectionTranslation for SX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) [12:03:42] kart_: don't think they're online [12:03:50] OK. I'll proceed with my patch then. [12:04:25] (03CR) 10RhinosF1: "this is scheduled for now. can you please show in the #wikimedia-operations Libera channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [12:04:37] kart_: I added a comment to patch [12:04:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:04:52] RhinosF1: Thanks! [12:04:56] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 48 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:07:00] (03CR) 10KartikMistry: [C: 03+2] "Config deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) (owner: 10KartikMistry) [12:07:44] (03Merged) 10jenkins-bot: Set ContentTranslationContentImportForSectionTranslation for SX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) (owner: 10KartikMistry) [12:09:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:15] (03CR) 10Jbond: [C: 03+1] "LGTM minor optional nit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 (owner: 10Volans) [12:11:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297094)', diff saved to https://phabricator.wikimedia.org/P18285 and previous config saved to /var/cache/conftool/dbconfig/20220103-121131-marostegui.json [12:11:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:11:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:39] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [12:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:13:21] (03CR) 10MdsShakil: "@RhinosF1 i don't understood your comments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [12:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:13] RhinosF1: ^ [12:14:42] taavi: syncing. Few minutes.. [12:14:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751073 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:14:59] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747794|Set ContentTranslationContentImportForSectionTranslation for SX (T294642)]] (duration: 00m 59s) [12:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:02] T294642: Publishing in Section Translation is prevented with a "There is no section X" error message - https://phabricator.wikimedia.org/T294642 [12:15:12] taavi: done. Floor is yours! [12:15:14] thanks! [12:15:28] zabe: let's do your patches next [12:15:36] ok [12:15:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751072 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:15:59] (03PS2) 10Majavah: Add a logo for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750805 (https://phabricator.wikimedia.org/T298438) (owner: 10Zabe) [12:16:02] (03CR) 10Majavah: [C: 03+2] Add a logo for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750805 (https://phabricator.wikimedia.org/T298438) (owner: 10Zabe) [12:16:51] (03Merged) 10jenkins-bot: Add a logo for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750805 (https://phabricator.wikimedia.org/T298438) (owner: 10Zabe) [12:16:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751093 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:17:10] (03CR) 10Majavah: Create autopatroller and patroller groups on bnwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [12:17:28] PROBLEM - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is CRITICAL: connect to address 10.64.32.128 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:17:29] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine to remove given that we never actually used since five years (and doublechecked with a Cumin run that no host in production cur" [puppet] - 10https://gerrit.wikimedia.org/r/751102 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:17:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:50] PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:10] zabe: the first one is on mwdebug1001 [12:19:01] taavi: works the intended way [12:19:42] PROBLEM - cassandra-a service on aqs1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:20:22] !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:750805|Add a logo for pwnwiki (T298438)]] (1/2) (duration: 00m 58s) [12:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:25] (03CR) 10Ladsgroup: "Thanks. I let Kormat take a look and after that I'll merge it and we do a gradual test runs in codfw, then in eqiad, etc." [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [12:20:25] T298438: Logo update on Wikipedia Paiwan - https://phabricator.wikimedia.org/T298438 [12:21:17] (03PS3) 10David Caro: b:h:j:{metatstore,server}: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751080 (https://phabricator.wikimedia.org/T272559) [12:21:19] (03PS1) 10David Caro: jmxtrans: remove unused modules [puppet] - 10https://gerrit.wikimedia.org/r/751112 (https://phabricator.wikimedia.org/T272559) [12:21:40] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:750805|Add a logo for pwnwiki (T298438)]] (2/3) (duration: 00m 57s) [12:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:45] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:750805|Add a logo for pwnwiki (T298438)]] (3/3) (duration: 00m 57s) [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:23:11] (03PS2) 10Majavah: Add a logo for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750826 (https://phabricator.wikimedia.org/T298439) (owner: 10Zabe) [12:23:14] (03CR) 10Jbond: [C: 03+1] identd: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751098 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:23:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/751105 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:23:55] (03CR) 10Majavah: [C: 03+2] Add a logo for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750826 (https://phabricator.wikimedia.org/T298439) (owner: 10Zabe) [12:24:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751091 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:24:45] (03Merged) 10jenkins-bot: Add a logo for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750826 (https://phabricator.wikimedia.org/T298439) (owner: 10Zabe) [12:25:05] zabe: next one is on mwdebug1001 [12:25:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:49] taavi: works the intended way [12:25:50] ty taavi, i was dealing with pensions [12:26:07] syncing then [12:26:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751104 (owner: 10Elukey) [12:26:58] !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:750826|Add a logo for amiwiki (T298439)]] (1/3) (duration: 00m 58s) [12:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:01] T298439: Update the logo of Wikipedia Amis - https://phabricator.wikimedia.org/T298439 [12:28:02] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:750826|Add a logo for amiwiki (T298439)]] (2/3) (duration: 00m 57s) [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:750826|Add a logo for amiwiki (T298439)]] (3/3) (duration: 00m 57s) [12:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:31] (03PS2) 10Majavah: Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748305 (https://phabricator.wikimedia.org/T294190) (owner: 10Zabe) [12:29:42] (03CR) 10Majavah: [C: 03+2] Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748305 (https://phabricator.wikimedia.org/T294190) (owner: 10Zabe) [12:30:25] (03Merged) 10jenkins-bot: Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748305 (https://phabricator.wikimedia.org/T294190) (owner: 10Zabe) [12:30:53] zabe: the last patch is on mwdebug1001 [12:30:56] RECOVERY - cassandra-a service on aqs1012 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:31:20] RECOVERY - Check systemd state on aqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:14] RECOVERY - cassandra-a CQL 10.64.32.128:9042 on aqs1012 is OK: TCP OK - 0.000 second response time on 10.64.32.128 port 9042 https://phabricator.wikimedia.org/T93886 [12:33:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:48] taavi: looks good to me (I don't really know how test this in a good was without actually uploading something) [12:34:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] I'll sync then (usually on upload-by-url patches we've went with the standard "MW does not fatal") [12:35:23] (03PS2) 10Majavah: Use new class names for CentralAuth RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 [12:35:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:748305|Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T294190)]] (duration: 00m 57s) [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:53] T294190: Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T294190 [12:37:08] deploying my own patch next [12:37:12] (03CR) 10Majavah: [C: 03+2] Use new class names for CentralAuth RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 (owner: 10Majavah) [12:37:50] (03Merged) 10jenkins-bot: Use new class names for CentralAuth RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 (owner: 10Majavah) [12:39:13] works on mwdebug1001, syncing [12:40:38] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:743683|Use new class names for CentralAuth RC feed]] (duration: 00m 57s) [12:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:57] MdsShakil is still not here [12:41:02] anyone have anything else? [12:41:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18286 and previous config saved to /var/cache/conftool/dbconfig/20220103-124117-marostegui.json [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:20] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [12:41:37] apparently not [12:41:44] !log UTC morning deploys done [12:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:21] !log installing openjdk-11 security updates on buster [12:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:13] (03CR) 10JMeybohm: [C: 04-1] "Although we don't really follow semver policy with the charts versions, I think it would be nice to be able to differentiate this change a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [12:51:36] (03CR) 10Muehlenhoff: [C: 03+1] "This looks fine, but changes sudo rules and thus needs IF meeting approval (next to happen in a wekk given the current no meeting week), i" [puppet] - 10https://gerrit.wikimedia.org/r/751104 (owner: 10Elukey) [12:54:27] (03CR) 10David Caro: [C: 03+1] "It seems only paws and toolforge use it, and both (tools/toolsbeta and paws) have that value set explicitly, so +1 from me" [puppet] - 10https://gerrit.wikimedia.org/r/739402 (owner: 10Majavah) [12:55:11] (03PS6) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [12:57:02] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:57:10] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Add campaign pattern for JOSA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749598 (https://phabricator.wikimedia.org/T298057) (owner: 10Gergő Tisza) [12:57:16] (03PS3) 10Kosta Harlan: GrowthExperiments: Add campaign pattern for JOSA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749598 (https://phabricator.wikimedia.org/T298057) (owner: 10Gergő Tisza) [12:57:41] (03CR) 10David Caro: [C: 03+1] kubeadm: raise default to 1.20 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739402 (owner: 10Majavah) [12:58:22] (03CR) 10David Caro: [C: 03+1] "Can you run a pcc on this? just to make sure it does not affect something unexpected" [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [12:59:21] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33096/console" [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:00:09] (03CR) 10MdsShakil: "@Majavah I am in online but currently i have faceing login issue at libera.chat. Please move on, I haven't been able to fix it after tryin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [13:00:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host build2001.codfw.wmnet [13:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:41] (03CR) 10Jelto: [V: 03+1] gitlab_runner: use config template for registering new runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:02:29] (03PS2) 10Majavah: kubeadm: raise default to 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/739402 [13:02:31] (03PS2) 10Majavah: aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 [13:02:50] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [13:05:45] Hey 👋 [13:07:06] (03PS1) 10Ssingh: P:wikidough: set number of anycast-hc backup logs to 1 [puppet] - 10https://gerrit.wikimedia.org/r/751115 [13:07:52] (03CR) 10JMeybohm: [C: 04-1] services: cleanup helmfiles, update SAL logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:08:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33097/console" [puppet] - 10https://gerrit.wikimedia.org/r/751115 (owner: 10Ssingh) [13:08:47] (03CR) 10MdsShakil: "Finally succeeded" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [13:09:44] (03CR) 10JMeybohm: [C: 03+1] "LGTM, but don't merge before https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/737034 obviously." [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:11:11] (03PS1) 10Muehlenhoff: Add build2001 [puppet] - 10https://gerrit.wikimedia.org/r/751116 (https://phabricator.wikimedia.org/T298463) [13:12:01] (03PS1) 10Ssingh: durum: set anycast-hc logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/751117 [13:12:54] (03PS1) 10Majavah: secrets: Fix example apt1001 kerberos keytab [labs/private] - 10https://gerrit.wikimedia.org/r/751118 [13:13:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33098/console" [puppet] - 10https://gerrit.wikimedia.org/r/751117 (owner: 10Ssingh) [13:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18287 and previous config saved to /var/cache/conftool/dbconfig/20220103-131707-marostegui.json [13:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:11] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [13:20:32] MdsShakil: you've missed the window [13:20:35] jouncebot: next [13:20:35] In 3 hour(s) and 9 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1630) [13:21:15] taavi: ^ [13:22:04] I'd prefer to not deploy it outside the deployment window, sorry :/ [13:23:01] I am facing login issue 😔 [13:23:20] (03CR) 10Jbond: [C: 03+1] secrets: Fix example apt1001 kerberos keytab [labs/private] - 10https://gerrit.wikimedia.org/r/751118 (owner: 10Majavah) [13:23:51] (03PS3) 10Jelto: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) [13:27:30] (03PS4) 10Jelto: charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) [13:28:20] jbond: can you merge that labs/private patch too? [13:29:35] (03CR) 10JMeybohm: [C: 04-1] charts: update charts to api v2 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [13:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18288 and previous config saved to /var/cache/conftool/dbconfig/20220103-133212-marostegui.json [13:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:28] (03CR) 10Kormat: "One minor nit." [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [13:38:07] (03PS1) 10Jelto: changeprop/eventgate: bump kafka-dev dependencie to 1.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) [13:39:07] (03CR) 10Jelto: "I'll rebuild the Chart.lock file as soon as I4f78d3de377c32a73bad5e8f9d6f59a75491c1a0 is merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/751120 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [13:40:05] MdsShakil: you can't put it in the portals window [13:40:26] (03CR) 10Jelto: charts: update charts to api v2 (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [13:42:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T277354)', diff saved to https://phabricator.wikimedia.org/P18289 and previous config saved to /var/cache/conftool/dbconfig/20220103-134227-marostegui.json [13:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:30] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:43:15] RhinosF1 next one is ok? 4 hours from now [13:44:18] MdsShakil: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220103T1900 is the next window you can use [13:44:48] UTC evening backport window [13:47:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18290 and previous config saved to /var/cache/conftool/dbconfig/20220103-134716-marostegui.json [13:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:49] RhinosF1 Done [13:49:10] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:54:23] (03CR) 10Jelto: [C: 03+1] "code changes and diff looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [13:55:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "makes sense." [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:58:35] (03PS5) 10RhinosF1: Create autopatroller and patroller groups on bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [13:58:54] (03CR) 10RhinosF1: [C: 03+1] Create autopatroller and patroller groups on bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [14:01:10] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:02:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18291 and previous config saved to /var/cache/conftool/dbconfig/20220103-140221-marostegui.json [14:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Maintenance [14:02:25] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [14:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Maintenance [14:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T297094)', diff saved to https://phabricator.wikimedia.org/P18292 and previous config saved to /var/cache/conftool/dbconfig/20220103-140232-marostegui.json [14:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [14:10:15] (03PS1) 10Giuseppe Lavagetto: deployment_server: add credentials for ci-restricted [puppet] - 10https://gerrit.wikimedia.org/r/751123 [14:11:36] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: add credentials for ci-restricted [puppet] - 10https://gerrit.wikimedia.org/r/751123 (owner: 10Giuseppe Lavagetto) [14:11:40] (03CR) 10Muehlenhoff: [C: 03+2] Add build2001 [puppet] - 10https://gerrit.wikimedia.org/r/751116 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [14:12:23] joe: I'll merge along? [14:12:26] moritzm: please merge my patch as well yes [14:12:41] done [14:12:47] thanks [14:13:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] secrets: Fix example apt1001 kerberos keytab [labs/private] - 10https://gerrit.wikimedia.org/r/751118 (owner: 10Majavah) [14:13:56] taavi: sorry missed your ping, the labs cr is merged nwo [14:14:06] thanks [14:14:14] np [14:15:01] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [14:18:47] (03CR) 10David Caro: [C: 03+2] aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [14:18:52] (03CR) 10David Caro: [C: 03+1] aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [14:22:31] (03PS1) 10Muehlenhoff: Add partman globbing for build* [puppet] - 10https://gerrit.wikimedia.org/r/751124 [14:23:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:26:27] (03PS1) 10David Caro: locales: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751126 (https://phabricator.wikimedia.org/T272559) [14:26:49] (03CR) 10Muehlenhoff: [C: 03+2] Add partman globbing for build* [puppet] - 10https://gerrit.wikimedia.org/r/751124 (owner: 10Muehlenhoff) [14:27:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:29:34] (03PS1) 10David Caro: logstash:input:syslog: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751127 (https://phabricator.wikimedia.org/T272559) [14:30:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:30:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297094)', diff saved to https://phabricator.wikimedia.org/P18293 and previous config saved to /var/cache/conftool/dbconfig/20220103-143034-marostegui.json [14:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:38] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [14:31:50] (03PS1) 10David Caro: logstash:plugin: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751129 (https://phabricator.wikimedia.org/T272559) [14:32:45] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:34:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Update to v0.2.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/749690 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:35:11] (03PS2) 10Ssingh: P:wikidough: set number of anycast-hc backup logs to 1 [puppet] - 10https://gerrit.wikimedia.org/r/751115 [14:37:36] (03CR) 10Ayounsi: [C: 03+2] Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi) [14:37:53] (03PS7) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [14:37:58] (03CR) 10Ladsgroup: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [14:38:44] (03Merged) 10jenkins-bot: Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi) [14:39:24] (03CR) 10Kormat: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [14:42:48] !log push CR744782 "Deprecate interface-range external" to all routers [14:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:07] (03CR) 10Ladsgroup: [C: 03+2] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [14:43:14] (03PS1) 10David Caro: lshell: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751130 (https://phabricator.wikimedia.org/T272559) [14:44:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:45:35] (03Merged) 10jenkins-bot: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [14:45:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18295 and previous config saved to /var/cache/conftool/dbconfig/20220103-144539-marostegui.json [14:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:33] !log published image docker-registry.discovery.wmnet/cfssl-issuer:0.2.0-1 - T294560 [14:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:36] T294560: Automate issuing of TLS certificates in kubernetes clusters - https://phabricator.wikimedia.org/T294560 [14:46:45] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Update to version 0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/749689 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:47:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:48:11] (03CR) 10Ssingh: [C: 03+2] P:wikidough: set number of anycast-hc backup logs to 1 [puppet] - 10https://gerrit.wikimedia.org/r/751115 (owner: 10Ssingh) [14:50:12] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:28] (03PS1) 10Ssingh: wikidough: disable firewall logging for dropped packets [puppet] - 10https://gerrit.wikimedia.org/r/751134 [14:51:27] (03Merged) 10jenkins-bot: cfssl-issuer: Update to version 0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/749689 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:52:31] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7556759, @Legoktm wrote: > I thought I had replied earlier, for now the plan is to test POSTing large files to Shellbox, identify what layers it... [14:53:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33099/console" [puppet] - 10https://gerrit.wikimedia.org/r/751134 (owner: 10Ssingh) [14:54:17] (03PS1) 10David Caro: mariadb: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751135 (https://phabricator.wikimedia.org/T272559) [14:55:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:56:09] jouncebot: now [14:56:10] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [14:56:14] I am going to restart Gerrit [14:59:00] !log Restarting Gerrit replica on gerrit2001 [14:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:34] (03PS1) 10David Caro: mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) [15:00:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Deprecate interface-range external - https://phabricator.wikimedia.org/T296935 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed! [15:00:24] !log Restarting Gerrit primary on gerrit1001 [15:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:00:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18296 and previous config saved to /var/cache/conftool/dbconfig/20220103-150045-marostegui.json [15:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] hashar: let me know if I can help (noticed gerrit down) [15:01:19] dcaro: I restarted it ;) [15:01:29] our human monitoring is wayyyy tooo fast [15:01:33] 👍 [15:02:01] it is back up! thank you dcaro ! [15:03:20] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: set anycast-hc logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/751117 (owner: 10Ssingh) [15:06:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10homer, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) [15:07:52] (03PS1) 10Muehlenhoff: Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146 [15:09:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751134 (owner: 10Ssingh) [15:11:39] (03PS1) 10David Caro: monitoring::graphite_freshness: remove define/cleanup [puppet] - 10https://gerrit.wikimedia.org/r/751148 (https://phabricator.wikimedia.org/T272559) [15:11:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: disable firewall logging for dropped packets [puppet] - 10https://gerrit.wikimedia.org/r/751134 (owner: 10Ssingh) [15:15:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297094)', diff saved to https://phabricator.wikimedia.org/P18297 and previous config saved to /var/cache/conftool/dbconfig/20220103-151550-marostegui.json [15:15:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:15:54] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [15:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T297094)', diff saved to https://phabricator.wikimedia.org/P18298 and previous config saved to /var/cache/conftool/dbconfig/20220103-151558-marostegui.json [15:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [15:24:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751148 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:26:59] 10SRE, 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10Papaul) 05Open→03Resolved @fgiunchedi disk replaced [15:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T277354)', diff saved to https://phabricator.wikimedia.org/P18299 and previous config saved to /var/cache/conftool/dbconfig/20220103-152710-marostegui.json [15:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:29:48] 10SRE, 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10Papaul) Tracking information {F34906059} [15:30:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) [15:37:19] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) @Vgutierrez Happy new year can I power this server off so I can swap DIMM A1 with DIMM B1? [15:38:32] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) @Papaul yes, go ahead please. Happy new year :) [15:41:04] !log installing edk2 security updates [15:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:10] PROBLEM - Host cp2029 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P18300 and previous config saved to /var/cache/conftool/dbconfig/20220103-154215-marostegui.json [15:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:23] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp2029.codfw.wmnet with reason: Swapping faulty DIMM with B1 [15:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp2029.codfw.wmnet with reason: Swapping faulty DIMM with B1 [15:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:32] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 0:30:00 1 host(s) and their services with reason: Swapping faulty DIMM with B1 ` cp2029.codfw.wmnet ` [15:43:32] !log installing datatables.js security updates [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:25] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Papaul) You have successfully submitted request SR1080456967 [15:46:50] (03PS1) 10Muehlenhoff: Add library hint for gmp [puppet] - 10https://gerrit.wikimedia.org/r/751155 [15:47:21] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Marostegui) Thank you! [15:49:23] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for gmp [puppet] - 10https://gerrit.wikimedia.org/r/751155 (owner: 10Muehlenhoff) [15:50:39] !log installing gmp security updates [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:26] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:51:57] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) a:03Papaul [15:53:00] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) I swapped DIMM A1 wiht DIMM B1 to see if the error shows on B1. I am leaving the task open for now . [15:53:11] !log installing publicsuffix 20211207.1025-0+deb11u1 on bullseye hosts [15:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:28] RECOVERY - Host cp2029 is UP: PING OK - Packet loss = 0%, RTA = 31.53 ms [15:53:36] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) p:05Triage→03Medium [15:53:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:54:51] (03PS1) 10Ladsgroup: sre.myql.upgrade: Fix missing argument [cookbooks] - 10https://gerrit.wikimedia.org/r/751157 (https://phabricator.wikimedia.org/T239814) [15:55:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:55:45] (03CR) 10Ladsgroup: [C: 03+2] sre.myql.upgrade: Fix missing argument [cookbooks] - 10https://gerrit.wikimedia.org/r/751157 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [15:56:41] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) @Papaul cool, I'll repool the server then [15:57:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P18301 and previous config saved to /var/cache/conftool/dbconfig/20220103-155720-marostegui.json [15:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:18] !log pool cp2029 - T298293 [15:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:21] T298293: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 [15:58:27] (03Merged) 10jenkins-bot: sre.myql.upgrade: Fix missing argument [cookbooks] - 10https://gerrit.wikimedia.org/r/751157 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [16:00:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS buster [16:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1084.eqiad.wmnet with O... [16:01:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297094)', diff saved to https://phabricator.wikimedia.org/P18302 and previous config saved to /var/cache/conftool/dbconfig/20220103-160131-marostegui.json [16:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:34] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [16:01:54] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:03:57] (03PS1) 10David Caro: netops: removed empty class [puppet] - 10https://gerrit.wikimedia.org/r/751158 (https://phabricator.wikimedia.org/T272559) [16:04:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS buster [16:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1085.eqiad.wmnet with O... [16:04:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:04:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS buster [16:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with O... [16:05:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS buster [16:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1087.eqiad.wmnet with O... [16:06:00] (03PS1) 10David Caro: nginx:snippet: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751159 (https://phabricator.wikimedia.org/T272559) [16:06:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS buster [16:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with O... [16:06:33] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:08:07] (03PS1) 10David Caro: nginx::ssl: remove class [puppet] - 10https://gerrit.wikimedia.org/r/751160 (https://phabricator.wikimedia.org/T272559) [16:08:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:10:04] (03PS1) 10Ladsgroup: sre.mysql.upgrade: Fix argparse [cookbooks] - 10https://gerrit.wikimedia.org/r/751161 (https://phabricator.wikimedia.org/T239814) [16:11:52] (03PS1) 10David Caro: osm::usergrants: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751162 (https://phabricator.wikimedia.org/T272559) [16:12:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T277354)', diff saved to https://phabricator.wikimedia.org/P18303 and previous config saved to /var/cache/conftool/dbconfig/20220103-161224-marostegui.json [16:12:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:12:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:12:29] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [16:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T277354)', diff saved to https://phabricator.wikimedia.org/P18304 and previous config saved to /var/cache/conftool/dbconfig/20220103-161232-marostegui.json [16:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but let's doublecheck with DBAs" [puppet] - 10https://gerrit.wikimedia.org/r/751135 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:16:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18305 and previous config saved to /var/cache/conftool/dbconfig/20220103-161635-marostegui.json [16:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/751126 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:17:34] (03PS1) 10David Caro: parsoid: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) [16:18:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1086.eqiad.wmnet with OS buster [16:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS bu... [16:18:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:18:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:22:10] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:22:26] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1088.eqiad.wmnet with OS buster [16:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS bu... [16:25:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS buster [16:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1084.eqiad.wmnet with OS bu... [16:25:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/751161 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [16:26:10] (03CR) 10Jbond: [C: 03+1] netops: removed empty class [puppet] - 10https://gerrit.wikimedia.org/r/751158 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:26:27] (03PS1) 10Cmjohnson: Adding new cloudbackup1003/4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/751164 (https://phabricator.wikimedia.org/T293934) [16:27:06] (03CR) 10Jbond: [C: 03+1] nginx::ssl: remove class [puppet] - 10https://gerrit.wikimedia.org/r/751160 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:27:24] (03CR) 10Jbond: [C: 03+1] nginx:snippet: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751159 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:27:29] (03CR) 10Cmjohnson: [C: 03+2] Adding new cloudbackup1003/4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/751164 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson) [16:28:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS buster [16:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1085.eqiad.wmnet with OS bu... [16:28:46] (03CR) 10Ladsgroup: [C: 03+2] sre.mysql.upgrade: Fix argparse [cookbooks] - 10https://gerrit.wikimedia.org/r/751161 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [16:29:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS buster [16:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1087.eqiad.wmnet with OS bu... [16:30:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS buster [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with O... [16:30:45] (03PS1) 10David Caro: {role:,profile:,}peek: remove unused classes [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) [16:31:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS buster [16:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with O... [16:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18306 and previous config saved to /var/cache/conftool/dbconfig/20220103-163140-marostegui.json [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:31:49] (03Merged) 10jenkins-bot: sre.mysql.upgrade: Fix argparse [cookbooks] - 10https://gerrit.wikimedia.org/r/751161 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [16:32:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:33:00] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751135 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:33:13] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:38] (03PS1) 10Giuseppe Lavagetto: deployment_server: fix permissions for mwbuilder/other [puppet] - 10https://gerrit.wikimedia.org/r/751166 (https://phabricator.wikimedia.org/T297673) [16:37:45] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db2144.codfw.wmnet [16:37:45] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2144.codfw.wmnet [16:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:24] (03CR) 10jerkins-bot: [V: 04-1] deployment_server: fix permissions for mwbuilder/other [puppet] - 10https://gerrit.wikimedia.org/r/751166 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [16:38:26] (03CR) 10Elukey: admin: allow all Analytics/DE members to manage cassandra on AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751104 (owner: 10Elukey) [16:40:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [16:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.e... [16:43:06] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1003.eqiad.wmnet with OS buster [16:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad... [16:43:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [16:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.e... [16:46:13] (03PS2) 10Giuseppe Lavagetto: deployment_server: fix permissions for mwbuilder/other [puppet] - 10https://gerrit.wikimedia.org/r/751166 (https://phabricator.wikimedia.org/T297673) [16:46:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297094)', diff saved to https://phabricator.wikimedia.org/P18307 and previous config saved to /var/cache/conftool/dbconfig/20220103-164645-marostegui.json [16:46:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:46:48] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [16:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18308 and previous config saved to /var/cache/conftool/dbconfig/20220103-164652-marostegui.json [16:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:04] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751167 (https://phabricator.wikimedia.org/T297534) [16:50:06] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751168 (https://phabricator.wikimedia.org/T297535) [16:54:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS buster [16:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS bu... [16:57:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:57:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS buster [16:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS bu... [17:02:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Cmjohnson) [17:02:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Cmjohnson) 05Open→03Resolved completed [17:02:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [17:06:13] (03PS1) 10DCausse: rdf-streaming-updater: increase capacity for commons [deployment-charts] - 10https://gerrit.wikimedia.org/r/751171 (https://phabricator.wikimedia.org/T262265) [17:11:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be2065, elastic1087, miscweb1002, elastic1084, elastic1085 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [17:11:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) @andrewbogott: these serves have 12 2TB disks, they failed during the raid setup, the first 2 disks are raid 1 and the remainder... [17:13:38] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS buster [17:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster... [17:14:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [17:15:05] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:26] (03PS1) 10Zabe: exdist: migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/751173 (https://phabricator.wikimedia.org/T273673) [17:16:52] (03PS2) 10Zabe: extdist: migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/751173 (https://phabricator.wikimedia.org/T273673) [17:19:27] (03PS1) 10Zabe: extdist: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751174 (https://phabricator.wikimedia.org/T273673) [17:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18309 and previous config saved to /var/cache/conftool/dbconfig/20220103-174608-marostegui.json [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:12] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [18:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18310 and previous config saved to /var/cache/conftool/dbconfig/20220103-180112-marostegui.json [18:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T277354)', diff saved to https://phabricator.wikimedia.org/P18311 and previous config saved to /var/cache/conftool/dbconfig/20220103-180743-marostegui.json [18:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [18:16:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18312 and previous config saved to /var/cache/conftool/dbconfig/20220103-181617-marostegui.json [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P18313 and previous config saved to /var/cache/conftool/dbconfig/20220103-182248-marostegui.json [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18314 and previous config saved to /var/cache/conftool/dbconfig/20220103-183122-marostegui.json [18:31:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:31:26] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [18:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T297094)', diff saved to https://phabricator.wikimedia.org/P18315 and previous config saved to /var/cache/conftool/dbconfig/20220103-183130-marostegui.json [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P18316 and previous config saved to /var/cache/conftool/dbconfig/20220103-183752-marostegui.json [18:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:11] (03PS3) 10Juan90264: Delete Tematica namespace (NS:104) in Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750814 (https://phabricator.wikimedia.org/T298315) [18:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T277354)', diff saved to https://phabricator.wikimedia.org/P18317 and previous config saved to /var/cache/conftool/dbconfig/20220103-185257-marostegui.json [18:52:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [18:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [18:53:01] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [18:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T277354)', diff saved to https://phabricator.wikimedia.org/P18318 and previous config saved to /var/cache/conftool/dbconfig/20220103-185305-marostegui.json [18:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:46] urbanecm, zabe: could either of you review https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/751182/ please? I'd like to get it included in this weeks train [19:12:19] taavi: I don't see why it should be PAGE_CLOSED_QUEUE? [19:13:36] urbanecm: because that's the method that handles the /closed page? it was PAGE_CLOSED_QUEUED before https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/751182/ too, and without it the closed requests list is empty on my local wiki [19:14:22] oh, ignore me, i was looking at handleClosedQueue in the new patch and handleOpenQueue in the broken patch [19:14:43] approved [19:14:45] thanks! [19:15:40] np [19:40:42] Can anyone tell me what I should do to edit other users' changes in Gerrit? [19:41:53] I'm trying but it shows "Error 403 (Forbidden): edit not permitted" and "Endpoint: /changes/*~*/edit/*" [19:42:30] Trusted Contributors? [19:43:42] Wait? Do I need to be a "Trusted Contributors" to be able to edit these cited changes? [19:44:46] Reedy: Could you let me know how I can join this group of "Trusted Contributors"? [19:46:06] It's an anti abuse mechanism - https://gerrit.wikimedia.org/r/admin/groups/2021f25e7515187a81d51f8fe14dd6f25617cce0 [19:46:28] I can add you [19:47:30] Thanks Reedy! [19:47:46] What gerrit usename? [19:48:18] Juan90264 [19:48:48] Done [19:49:33] Perfect, thanks again Reedy! [19:50:56] (03PS4) 10Juan90264: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [20:33:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T277354)', diff saved to https://phabricator.wikimedia.org/P18319 and previous config saved to /var/cache/conftool/dbconfig/20220103-203654-marostegui.json [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [20:52:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P18320 and previous config saved to /var/cache/conftool/dbconfig/20220103-205159-marostegui.json [20:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:42] (03PS5) 10Juan90264: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [21:00:21] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10faidon) Not sure if this has been flagged by anyone else or considered but note that our mirror is an official mirror for [[ https://salsa.debian.org/mirror-team/masterlist/-/b... [21:00:43] (03PS6) 10Juan90264: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [21:05:56] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P18321 and previous config saved to /var/cache/conftool/dbconfig/20220103-210704-marostegui.json [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:18] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:19:36] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:52] (03CR) 10Cwhite: logstash:input:syslog: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751127 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [21:21:12] PROBLEM - snapshot of s1 in eqiad on alert1001 is CRITICAL: snapshot for s1 at eqiad taken more than 3 days ago: Most recent backup 2021-12-31 20:49:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:21:26] (03CR) 10Ammarpad: [C: 03+1] Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) (owner: 10D3r1ck01) [21:21:54] (03CR) 10Cwhite: [C: 03+1] "AFAIK we do not install logstash plugins this way anymore." [puppet] - 10https://gerrit.wikimedia.org/r/751129 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [21:22:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T277354)', diff saved to https://phabricator.wikimedia.org/P18322 and previous config saved to /var/cache/conftool/dbconfig/20220103-212209-marostegui.json [21:22:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [21:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [21:22:12] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [21:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T277354)', diff saved to https://phabricator.wikimedia.org/P18323 and previous config saved to /var/cache/conftool/dbconfig/20220103-212216-marostegui.json [21:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:18] (03PS1) 10Zabe: graphite: migrate archiver crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) [21:32:20] (03PS1) 10Zabe: graphite: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751208 (https://phabricator.wikimedia.org/T273673) [21:37:32] (03PS1) 10Muehlenhoff: Switch Kunal to volunteer NDA status [puppet] - 10https://gerrit.wikimedia.org/r/751210 [21:43:35] (03CR) 10Cwhite: [C: 03+2] profile: turn off grafana db sync ahead of 8.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/740682 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [21:50:57] !log manually upgrade to grafana 8 on grafana-next (T282863) [21:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:00] T282863: Upgrade Grafana to 8.x - https://phabricator.wikimedia.org/T282863 [23:08:18] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:21:14] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:23:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T277354)', diff saved to https://phabricator.wikimedia.org/P18324 and previous config saved to /var/cache/conftool/dbconfig/20220103-232433-marostegui.json [23:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:37] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [23:39:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P18325 and previous config saved to /var/cache/conftool/dbconfig/20220103-233938-marostegui.json [23:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P18326 and previous config saved to /var/cache/conftool/dbconfig/20220103-235443-marostegui.json [23:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log