[00:00:26] 10SRE, 10DBA, 10MW-1.39-notes (1.39.0-wmf.14; 2022-05-30), 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) 05Open→03Resolved [00:05:05] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01016 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29699 and previous config saved to /var/cache/conftool/dbconfig/20220614-000558-marostegui.json [00:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29700 and previous config saved to /var/cache/conftool/dbconfig/20220614-002103-marostegui.json [00:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:48] weird: it said "widespread puppet failures" up there [00:24:02] and 2 mw hosts were in the list. I ran puppet there.. and no failure. all normal. [00:24:19] but the hosts also disappeared from the list now. so more like a puppetmaster hickup [00:28:50] (03CR) 10RLazarus: "Oh, I'd been wanting to get around to this and it dropped off my radar, thanks for beating me to it!" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [00:30:52] (03PS1) 10Brennen Bearnes: gitlab_runner: Allow subdirs in image paths [puppet] - 10https://gerrit.wikimedia.org/r/805247 (https://phabricator.wikimedia.org/T310535) [00:35:33] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:36:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T310011)', diff saved to https://phabricator.wikimedia.org/P29701 and previous config saved to /var/cache/conftool/dbconfig/20220614-003608-marostegui.json [00:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:14] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [00:42:29] (03CR) 10Tim Starling: make_beta_config.py: run helm as helm3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578) (owner: 10Tim Starling) [00:42:41] (03Abandoned) 10Tim Starling: make_beta_config.py: run helm as helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578) (owner: 10Tim Starling) [00:48:11] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:50:19] (03CR) 10Legoktm: Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [00:55:41] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:57:09] (03PS6) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) [00:57:17] (03PS2) 10Tim Starling: Switch wgMainStash back to Redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804024 (https://phabricator.wikimedia.org/T212129) [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T0100) [01:04:12] (03CR) 10RLazarus: [C: 03+1] mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [01:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29702 and previous config saved to /var/cache/conftool/dbconfig/20220614-015532-ladsgroup.json [01:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:38] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:06:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.16 [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805252 [02:07:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.16 [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805252 (owner: 10TrainBranchBot) [02:07:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:07:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29703 and previous config saved to /var/cache/conftool/dbconfig/20220614-021037-ladsgroup.json [02:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:21] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.16 [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805252 (owner: 10TrainBranchBot) [02:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29704 and previous config saved to /var/cache/conftool/dbconfig/20220614-022542-ladsgroup.json [02:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:31:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29705 and previous config saved to /var/cache/conftool/dbconfig/20220614-024047-ladsgroup.json [02:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:52] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:50:43] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:01] (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:33] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:55] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:25] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.133 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:57:45] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:13] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:52:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 23 hosts with reason: Primary switchover s6 T300471 [04:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:04] T300471: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T300471 [04:52:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 23 hosts with reason: Primary switchover s6 T300471 [04:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1131 with weight 0 T300471', diff saved to https://phabricator.wikimedia.org/P29706 and previous config saved to /var/cache/conftool/dbconfig/20220614-045224-root.json [04:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:30] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) [05:00:08] (03CR) 10Tim Starling: [C: 03+2] Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [05:01:03] (03Merged) 10jenkins-bot: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [05:04:05] (03CR) 10Majavah: [C: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [05:05:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:06:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [05:08:48] mwdebug-deploy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [05:11:30] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T212129 Switch wgMainStash to db-mainstash (duration: 03m 38s) [05:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:34] T212129: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 [05:12:31] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:23:45] (03PS4) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) [05:26:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [05:31:02] (03PS1) 10Marostegui: db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805265 (https://phabricator.wikimedia.org/T300471) [05:45:29] Someone's reported beta is complaining about unknown cluster extension2 [05:46:07] * RhinosF1 thought you had to be on IRC while deploying [05:50:15] (03CR) 10Samwilson: "This seems to have broken Beta https://en.wikipedia.beta.wmflabs.org/ :" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [05:51:17] TimStarling: ^ [05:51:21] (03PS1) 10Tim Starling: Switch AbuseFilter profiler back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805268 (https://phabricator.wikimedia.org/T212129) [05:52:39] (03CR) 10Marostegui: [C: 03+1] Switch AbuseFilter profiler back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805268 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [05:53:01] (03PS1) 10Tim Starling: Configure FilterProfiler cache separately [extensions/AbuseFilter] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805160 (https://phabricator.wikimedia.org/T212129) [05:53:12] (03CR) 10Tim Starling: [C: 03+2] Configure FilterProfiler cache separately [extensions/AbuseFilter] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805160 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T0600). [06:00:06] jynus: let's go? [06:00:25] one sec I get my graphs right [06:01:33] ready [06:01:41] !log Starting s6 eqiad failover from db1173 to db1131 - T300471 [06:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:45] T300471: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T300471 [06:01:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T300471', diff saved to https://phabricator.wikimedia.org/P29707 and previous config saved to /var/cache/conftool/dbconfig/20220614-060155-root.json [06:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:05] RO confirmed [06:02:12] "Warning: The database has been locked for maintenance" [06:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1131 to s6 primary and set section read-write T300471', diff saved to https://phabricator.wikimedia.org/P29708 and previous config saved to /var/cache/conftool/dbconfig/20220614-060227-root.json [06:02:29] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [06:02:38] we are RW again [06:02:39] checking [06:02:52] yes [06:03:04] I can write on wikitech [06:03:23] lag shown on orchestrator [06:03:31] gone now [06:03:32] should be gone now [06:03:47] checking for errors [06:04:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [06:05:31] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:06:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 T300471', diff saved to https://phabricator.wikimedia.org/P29709 and previous config saved to /var/cache/conftool/dbconfig/20220614-060608-root.json [06:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:01] jynus: I think we are done [06:07:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:07:06] jynus: Thanks for all the help [06:07:24] elastic issue related to something? [06:07:45] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:09:49] I am checking [06:09:53] But I don't see it for now [06:11:05] (03Merged) 10jenkins-bot: Configure FilterProfiler cache separately [extensions/AbuseFilter] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805160 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [06:14:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:15:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:04] (03CR) 10Marostegui: [C: 03+2] db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805265 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [06:20:26] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/extension.json: T212129 (duration: 03m 32s) [06:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:29] T212129: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 [06:23:25] (03CR) 10Tim Starling: [C: 03+2] Switch AbuseFilter profiler back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805268 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [06:24:19] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/includes/ServiceWiring.php: T212129 (duration: 03m 33s) [06:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:31] (03Merged) 10jenkins-bot: Switch AbuseFilter profiler back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805268 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [06:27:35] !log Reboot dbproxy1012 and dbproxy1015 T310484 [06:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:39] T310484: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 [06:28:26] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T212129 (duration: 03m 31s) [06:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:30] T212129: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 [06:29:27] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [06:31:17] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:31:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:14] (03PS1) 10Marostegui: wmnet: Failover m3 and m5 master. [dns] - 10https://gerrit.wikimedia.org/r/805271 (https://phabricator.wikimedia.org/T310484) [06:32:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:32:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:49] marostegui: https://phabricator.wikimedia.org/T310569#8001223 [06:45:04] TimStarling: ^ [06:49:03] I was mostly busy making sure production wasn't broken [06:52:17] (03PS1) 10Tim Starling: Add extension2 cluster on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805315 (https://phabricator.wikimedia.org/T310569) [06:53:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148 for schema change', diff saved to https://phabricator.wikimedia.org/P29710 and previous config saved to /var/cache/conftool/dbconfig/20220614-065322-root.json [06:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:30] (03CR) 10Tim Starling: [C: 03+2] Add extension2 cluster on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805315 (https://phabricator.wikimedia.org/T310569) (owner: 10Tim Starling) [06:55:16] (03PS2) 10Marostegui: wmnet: Failover m3 and m5 master. [dns] - 10https://gerrit.wikimedia.org/r/805271 (https://phabricator.wikimedia.org/T310484) [06:55:40] (03Merged) 10jenkins-bot: Add extension2 cluster on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805315 (https://phabricator.wikimedia.org/T310569) (owner: 10Tim Starling) [06:58:47] GRANT SELECT, INSERT, UPDATE, DELETE ON `%a%`.* TO `wikiuser`@`10.%` [06:58:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:23] ^GRANT? [06:59:23] well, mainstash has an "a" in it so I guess that should be OK [06:59:36] that is how beta is configured? [06:59:47] yeah that is from show grants on beta [06:59:51] :-( [06:59:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:59:55] :-/ [06:59:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T0700). [07:00:04] samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] still no rows in the table though [07:00:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:39] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:11] TimStarling: error can be reproduced consistently on url: https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:CreateAccount [07:02:55] yeah, the change is not deployed [07:02:57] the update jobs don't seem to have executed yet: https://integration.wikimedia.org/ci/view/Beta/ [07:03:01] ah! [07:03:21] I thought you were debugging post deploy [07:03:26] so did I [07:03:38] I don't ever use this site, don't know how it works [07:03:39] :-) [07:03:49] !log dbmaint s6@eqiad T300381 [07:03:49] Can the UTC morning backport window happen this hour? Or should it be cancelled? [07:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:04:20] probably it will be fixed when that change finds its way out of gerrit [07:04:21] ^I think DBA should be ok, TimStarling to say? [07:04:32] ^ samwilson [07:04:39] should be fine [07:04:44] back in half an hour [07:04:56] ok cool, thanks [07:05:35] urbanecm: are you around to do a deploy? [07:05:52] https://integration.wikimedia.org/zuul/ claims 9 minutes until beta for anyone that cares [07:05:54] i can deploy in theory too, but am on a rather spotty connection so would rather not to [07:06:16] RhinosF1: no, it says that the job has been queued/running for 10 minutes [07:06:55] taavi: oh ye, it's other way round [07:08:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one optional comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:08:08] taavi: ok sure. maybe if no one else can you could, in a little while? [07:08:27] (03PS3) 10Majavah: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:09:01] (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:09] (03CR) 10Majavah: [C: 03+2] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:10:20] (03Merged) 10jenkins-bot: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:11:08] TimStarling: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805315 was never pulled to deploy1002, ignoring it since it only touches -labs.php files [07:11:28] samwilson: please test the first patch on mwdebug1001 [07:11:38] thanks. testing now. [07:12:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:12:31] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:12:31] taavi: yep, t'riffic, all looks correct. [07:12:48] (03CR) 10Jcrespo: [C: 03+2] wmnet: Failover m3 and m5 master. [dns] - 10https://gerrit.wikimedia.org/r/805271 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [07:12:52] ok, syncing [07:13:02] (03CR) 10Jcrespo: [C: 03+1] wmnet: Failover m3 and m5 master. [dns] - 10https://gerrit.wikimedia.org/r/805271 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [07:13:05] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3 and m5 master. [dns] - 10https://gerrit.wikimedia.org/r/805271 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [07:13:33] (03PS2) 10Majavah: Enable Realtime Preview on cawiki, viwiki, and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804806 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:13:44] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) m3 and m5 dbproxies failed over, waiting a few hours for connections to move gracefully to the other proxies [07:14:57] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:15:16] (03CR) 10Majavah: [C: 03+2] Enable Realtime Preview on cawiki, viwiki, and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804806 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:16:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:26] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:802685|Update $wgVectorMaxWidthOptions to include action=edit (T307725)]] (duration: 03m 36s) [07:16:28] (03Merged) 10jenkins-bot: Enable Realtime Preview on cawiki, viwiki, and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804806 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:29] T307725: Make action=edit with 2010 wikitext editor a full-width page in Vector-2022 - https://phabricator.wikimedia.org/T307725 [07:16:47] aaand live [07:16:55] samwilson: second one is available for testing too [07:17:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:17:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:13] great! looking now... [07:17:31] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:18:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:40] taavi: yep, looks perfect. go for it. [07:18:52] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) @KFrancis we just need to update Gorans email address, it's still listed as the wikimedia.de address. Could you please provide me with the updated add... [07:19:11] syncing that too [07:20:26] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805181 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:20:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] (03PS2) 10Muehlenhoff: cloudnfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805182 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:22:30] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:804806|Enable Realtime Preview on cawiki, viwiki, and fawiki (T303961)]] (duration: 03m 20s) [07:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:34] T303961: Rollout plan for real-time preview - https://phabricator.wikimedia.org/T303961 [07:22:34] samwilson: all done! [07:22:44] anyone has anything else to deploy? [07:22:56] taavi: thanks! [07:23:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] !log dbmaint s6@eqiad T298563 [07:24:53] apparently no [07:24:54] !log UTC morning deploys done [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:57] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:19] (03PS1) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) [07:30:34] (03PS2) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) [07:32:45] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805182 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:33:16] (03PS2) 10Muehlenhoff: cloudlib: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805183 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:33:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:33:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T310011)', diff saved to https://phabricator.wikimedia.org/P29711 and previous config saved to /var/cache/conftool/dbconfig/20220614-073322-marostegui.json [07:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:26] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:35:20] (03CR) 10CI reject: [V: 04-1] wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [07:37:30] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805183 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:38:17] (03PS2) 10Muehlenhoff: cfssl: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805185 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:38:26] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#7952199, @fgiunchedi wrote: > hi @Jgiannelos, I have resumed work on this and was wondering what's the... [07:40:18] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) We have removed Goran from the WMDE group, as that is only for WMDE staff. [07:40:51] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805185 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:42:04] (03PS2) 10Muehlenhoff: cinderutils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805184 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T310011)', diff saved to https://phabricator.wikimedia.org/P29712 and previous config saved to /var/cache/conftool/dbconfig/20220614-074331-marostegui.json [07:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:35] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:43:39] (03CR) 10Muehlenhoff: [C: 03+2] cinderutils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805184 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:44:53] (03PS2) 10Muehlenhoff: cergen: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805186 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:45:06] !log joal@deploy1002 Started deploy [analytics/refinery@f146a63]: Regular analytics weekly train [analytics/refinery@f146a63] [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:45] (03CR) 10Muehlenhoff: [C: 03+2] cergen: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805186 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:46:52] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805186 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:48:30] (03PS2) 10Muehlenhoff: celery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805187 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:53:02] (03PS3) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) [07:55:14] (03PS7) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [07:55:55] (03CR) 10David Caro: [V: 03+1] wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [07:57:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good to go! Thanks David for you patience" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [07:57:16] (03CR) 10CI reject: [V: 04-1] wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [07:58:08] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [07:58:12] (03PS3) 10Filippo Giunchedi: sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) [07:58:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29713 and previous config saved to /var/cache/conftool/dbconfig/20220614-075837-marostegui.json [07:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35838/console" [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:00:36] (03CR) 10Slyngshede: logster::job migrate cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:01:05] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805187 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:02:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS buster [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:24] (03PS3) 10Muehlenhoff: burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:02:44] (03CR) 10David Caro: [V: 03+1 C: 03+2] wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [08:02:49] (03CR) 10Filippo Giunchedi: "LGTM overall! Nice job!" [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [08:03:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/805241 (owner: 10BCornwall) [08:09:27] (03CR) 10Muehlenhoff: [C: 03+2] burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:10:57] (03PS4) 10David Caro: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) [08:13:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29714 and previous config saved to /var/cache/conftool/dbconfig/20220614-081342-marostegui.json [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:16:16] !log joal@deploy1002 Finished deploy [analytics/refinery@f146a63]: Regular analytics weekly train [analytics/refinery@f146a63] (duration: 31m 09s) [08:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:32] !log dbmaint s6@eqiad T309311 [08:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:35] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [08:18:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [08:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:45] (03PS5) 10Jaime Nuche: scap: switch over from Debian package to self-installed scap [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) [08:19:10] (03CR) 10Jaime Nuche: scap: switch over from Debian package to self-installed scap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:20:02] (03PS8) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [08:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:20:47] !log dbmaint s6@eqiad T298560 [08:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:51] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [08:22:21] (03CR) 10Muehlenhoff: [C: 03+2] scap: switch over from Debian package to self-installed scap [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:23:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [08:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35839/console" [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:28:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T310011)', diff saved to https://phabricator.wikimedia.org/P29715 and previous config saved to /var/cache/conftool/dbconfig/20220614-082847-marostegui.json [08:28:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [08:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [08:28:51] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T310011)', diff saved to https://phabricator.wikimedia.org/P29716 and previous config saved to /var/cache/conftool/dbconfig/20220614-082855-marostegui.json [08:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:59] (03PS5) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) [08:37:15] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T310011)', diff saved to https://phabricator.wikimedia.org/P29717 and previous config saved to /var/cache/conftool/dbconfig/20220614-083807-marostegui.json [08:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:12] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:38:32] jouncebot: next [08:38:32] In 4 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300) [08:38:45] !log reboot centrallog2002 - T310483 [08:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [08:39:30] !log joal@deploy1002 Started deploy [analytics/refinery@f146a63]: Regular analytics weekly train - Second [analytics/refinery@f146a63] [08:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] PROBLEM - Host centrallog2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:41:41] RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [08:44:16] !log joal@deploy1002 Finished deploy [analytics/refinery@f146a63]: Regular analytics weekly train - Second [analytics/refinery@f146a63] (duration: 04m 45s) [08:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:56] !log joal@deploy1002 Started deploy [analytics/refinery@f146a63] (thin): Regular analytics weekly train - THIN [analytics/refinery@f146a63] [08:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:05] !log joal@deploy1002 Finished deploy [analytics/refinery@f146a63] (thin): Regular analytics weekly train - THIN [analytics/refinery@f146a63] (duration: 00m 08s) [08:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:37] !log joal@deploy1002 Started deploy [analytics/refinery@f146a63] (hadoop-test): Regular analytics weekly train - TEST [analytics/refinery@f146a63] [08:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:08] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10BTullis) I can look into compiling bigtop 1.5 for Bullseye - as we'll need it anyway. [08:46:31] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [08:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [08:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:49] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [08:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [08:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:40] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [08:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:05] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [08:51:50] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [08:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:04] !log joal@deploy1002 Finished deploy [analytics/refinery@f146a63] (hadoop-test): Regular analytics weekly train - TEST [analytics/refinery@f146a63] (duration: 07m 27s) [08:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:11] (03PS1) 10Slyngshede: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 [08:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P29718 and previous config saved to /var/cache/conftool/dbconfig/20220614-085312-marostegui.json [08:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:55:57] (03PS2) 10Slyngshede: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) [08:56:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2003.codfw.wmnet with OS buster [08:56:05] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netmon1003.wikimedia.org [08:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:42] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-fe1001.eqiad.wmnet [08:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [08:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:06] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite1004.eqiad.wmnet [08:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:55] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [08:59:11] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 4.745e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [08:59:23] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [08:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [08:59:31] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [08:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] the backend failures for cirrussearch is due to the graphite1004 reboot, ditto osm sync lag actually I think [09:00:56] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1058.eqiad.wmnet with OS bullseye [09:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1058.eqiad.wmnet with OS bullseye [09:01:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but needs on task approval for "analytics-privatedata-users"" [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) (owner: 10Slyngshede) [09:01:25] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [09:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:36] (03PS1) 10Btullis: Remove stray letters from the JVM heap config [puppet] - 10https://gerrit.wikimedia.org/r/805323 (https://phabricator.wikimedia.org/T310293) [09:04:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [09:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10SLyngshede-WMF) This need approval of @odimitrijevic or @Ottomata [09:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:05:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [09:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:08:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [09:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P29719 and previous config saved to /var/cache/conftool/dbconfig/20220614-090817-marostegui.json [09:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:47] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:08:57] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [09:09:02] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [09:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [09:09:16] (ThanosSidecarUnhealthy) firing: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarUnhealthy [09:09:23] (03PS2) 10Muehlenhoff: backy2: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805192 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:09:25] (03PS1) 10Slyngshede: Extend piccardi [puppet] - 10https://gerrit.wikimedia.org/r/805325 [09:09:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [09:10:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:10:51] (03CR) 10Muehlenhoff: [C: 03+2] backy2: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805192 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:11:42] (03PS2) 10Muehlenhoff: atskafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805193 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:12:25] (03CR) 10Elukey: [C: 03+1] Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:13:49] (03CR) 10Muehlenhoff: [C: 03+2] aqs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805194 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:13:56] (03PS2) 10Muehlenhoff: aqs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805194 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:14:16] (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown [09:14:16] (ThanosSidecarUnhealthy) resolved: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarUnhealthy [09:14:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [09:14:43] (03PS1) 10Filippo Giunchedi: pontoon: reload haproxy post-certbot [puppet] - 10https://gerrit.wikimedia.org/r/805326 [09:14:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:59] (03CR) 10Muehlenhoff: "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805194 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:15:02] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [09:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1058.eqiad.wmnet with reason: host reimage [09:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [09:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:26] (03CR) 10Btullis: [C: 03+2] Remove stray letters from the JVM heap config [puppet] - 10https://gerrit.wikimedia.org/r/805323 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [09:16:45] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [09:16:46] (03PS2) 10Muehlenhoff: apereo_cas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805195 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:58] (03PS5) 10Klausman: Add nference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 [09:17:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:28] (03PS4) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) [09:17:34] (03CR) 10Slyngshede: [C: 03+2] Extend piccardi [puppet] - 10https://gerrit.wikimedia.org/r/805325 (owner: 10Slyngshede) [09:18:19] (03CR) 10Muehlenhoff: [C: 03+2] apereo_cas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805195 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:18:46] (03CR) 10JMeybohm: Add nference-staging service in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman) [09:18:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1058.eqiad.wmnet with reason: host reimage [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:17] (03CR) 10Muehlenhoff: [C: 03+2] alternatives: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805196 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:19:54] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805196 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:19:59] (03PS2) 10Muehlenhoff: alternatives: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805196 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:20:25] (03PS3) 10Slyngshede: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) [09:21:26] (03CR) 10Klausman: [C: 03+2] Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:21:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [09:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:36] (03PS1) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) [09:22:58] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [09:23:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [09:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T310011)', diff saved to https://phabricator.wikimedia.org/P29720 and previous config saved to /var/cache/conftool/dbconfig/20220614-092322-marostegui.json [09:23:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:23:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:26] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T310011)', diff saved to https://phabricator.wikimedia.org/P29721 and previous config saved to /var/cache/conftool/dbconfig/20220614-092330-marostegui.json [09:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:58] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [09:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:01] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: reload haproxy post-certbot [puppet] - 10https://gerrit.wikimedia.org/r/805326 (owner: 10Filippo Giunchedi) [09:25:15] (03PS6) 10Klausman: Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 [09:25:28] (03CR) 10Klausman: Add inference-staging service in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman) [09:27:04] (03CR) 10Klausman: [C: 03+2] Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman) [09:27:44] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:08] (03PS2) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) [09:28:21] (03PS2) 10Muehlenhoff: airflow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805197 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:30:39] (03PS3) 10Lucas Werkmeister (WMDE): maintain-meta_p: stop reading VariantSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/665116 [09:32:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1058.eqiad.wmnet with OS bullseye [09:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:28] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1058.eqiad.wmnet with OS bullseye completed: - ms-be1058 (**PASS**) - Downtim... [09:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T310011)', diff saved to https://phabricator.wikimedia.org/P29722 and previous config saved to /var/cache/conftool/dbconfig/20220614-093240-marostegui.json [09:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:45] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:32:48] (03PS2) 10Lucas Werkmeister (WMDE): puppet_alert: Improve message [puppet] - 10https://gerrit.wikimedia.org/r/791559 [09:34:13] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805197 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:36:52] (03PS1) 10Btullis: Add DNS CNAME records for datahub ingress on k8s [dns] - 10https://gerrit.wikimedia.org/r/805328 (https://phabricator.wikimedia.org/T303049) [09:37:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10SLyngshede-WMF) p:05Triage→03High a:03SLyngshede-WMF [09:37:19] (03PS2) 10Btullis: Add DNS CNAME records for datahub ingress on k8s [dns] - 10https://gerrit.wikimedia.org/r/805328 (https://phabricator.wikimedia.org/T303049) [09:38:37] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) [09:38:57] (03PS1) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [09:40:18] (03PS2) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [09:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P29723 and previous config saved to /var/cache/conftool/dbconfig/20220614-094745-marostegui.json [09:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:03] (03CR) 10Elukey: service::catalog: Add inference-staging service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:51:42] (03PS3) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [09:51:49] (03CR) 10Klausman: service::catalog: Add inference-staging service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:52:24] (03PS4) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [09:56:17] (03PS6) 10Btullis: Add DataHub GMS and frontend services to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [09:56:57] (03PS7) 10Btullis: Add DataHub GMS and frontend services to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) [09:57:42] (03CR) 10Btullis: Add DataHub GMS and frontend services to the service catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [10:02:40] (03PS1) 10Btullis: Update the trafficserver rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/805331 (https://phabricator.wikimedia.org/T303049) [10:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P29724 and previous config saved to /var/cache/conftool/dbconfig/20220614-100250-marostegui.json [10:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:59] !log rename Ganeti group row_A in test cluster to row_A-test [10:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:10] (03CR) 10Jbond: [C: 03+2] docker_registry_ha: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:06:12] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:06:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS buster [10:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:26] (03PS2) 10Jbond: cacheproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805188 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:06:40] (03PS1) 10Muehlenhoff: Update row name in test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/805334 [10:06:43] (03CR) 10Jbond: [C: 03+2] cacheproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805188 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:06:56] (03PS2) 10Jbond: bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:07:20] (03CR) 10Volans: [C: 03+2] "Thanks a lot!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/805334 (owner: 10Muehlenhoff) [10:07:54] (03PS2) 10Jbond: bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:08:13] (03CR) 10Jbond: [C: 03+2] atskafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805193 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:08:51] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) OK @JMeybohm I've created three CRs that I think should do what we need to finish this. * Adding CNAME records to DNS * Adding service catalog entries... [10:09:03] (03CR) 10Elukey: "One last bit and it looks good afaics!" [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:09:14] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:09:17] (03CR) 10CI reject: [V: 04-1] bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:09:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After migrating to 10.6', diff saved to https://phabricator.wikimedia.org/P29725 and previous config saved to /var/cache/conftool/dbconfig/20220614-100930-root.json [10:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:54] (03PS5) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [10:10:06] (03CR) 10Klausman: service::catalog: Add inference-staging service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:10:26] (03CR) 10Btullis: [C: 03+2] Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [10:11:16] (03CR) 10Elukey: [C: 03+1] "LGTM! Before merging, let's wait for Valentin to review just to be sure that we are not missing anything." [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:11:30] (03CR) 10Hnowlan: [C: 03+2] Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:13:47] (03CR) 10Ayounsi: [C: 03+1] Update row name in test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/805334 (owner: 10Muehlenhoff) [10:14:21] (03PS3) 10Jbond: bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:14:23] elukey: just as a heads up I'm out of office till Monday [10:14:37] (03PS4) 10Jbond: bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:14:54] (03Merged) 10jenkins-bot: Update row name in test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/805334 (owner: 10Muehlenhoff) [10:16:00] (03CR) 10Jbond: [C: 03+2] bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:16:26] (03PS4) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 [10:16:28] (03PS5) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [10:17:28] (03PS3) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) [10:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T310011)', diff saved to https://phabricator.wikimedia.org/P29726 and previous config saved to /var/cache/conftool/dbconfig/20220614-101755-marostegui.json [10:17:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:18:01] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:18:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [10:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:22] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:24] (03PS4) 10David Caro: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) [10:19:19] (03PS6) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [10:19:34] !log dbmaint s6@eqiad T60674 [10:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:37] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [10:20:01] (03CR) 10CI reject: [V: 04-1] Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [10:21:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3001.esams.wmnet [10:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:23] (03PS7) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [10:22:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1173.eqiad.wmnet with OS bullseye [10:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [10:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After migrating to 10.6', diff saved to https://phabricator.wikimedia.org/P29727 and previous config saved to /var/cache/conftool/dbconfig/20220614-102433-root.json [10:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:23] (03PS1) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:25:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [10:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3001.esams.wmnet [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3001.esams.wmnet to ganeti01.svc.esams.wmnet [10:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3001.esams.wmnet to ganeti01.svc.esams.wmnet [10:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:56] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [10:33:32] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Hey @fgiunchedi the size of the current active deployment is stabilized at ~12269804 objects for quite some time now... [10:34:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:21] (03PS7) 10Hnowlan: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:37:36] PROBLEM - hue.wikimedia.org requires authentication on an-tool1009 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:37:42] PROBLEM - Check systemd state on an-tool1009 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:24] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:39:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After migrating to 10.6', diff saved to https://phabricator.wikimedia.org/P29728 and previous config saved to /var/cache/conftool/dbconfig/20220614-103937-root.json [10:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:40:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:14] (03PS2) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T310011)', diff saved to https://phabricator.wikimedia.org/P29729 and previous config saved to /var/cache/conftool/dbconfig/20220614-104021-marostegui.json [10:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:26] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:41:06] (03CR) 10Volans: "addressed comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [10:42:28] 10SRE, 10SRE-Access-Requests: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10TheresNoTime) 05Resolved→03Open @MoritzMuehlenhoff just tried accessing https://superset.wikimedia.org/superset/dashboard/309/ and got ` Error: {'message': 'Permission denied:... [10:42:46] (03PS1) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 [10:43:40] (03PS3) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:44:08] (03CR) 10Jbond: "This check is a bit racy and wouldn't allow for two people editing the private repo at the same time. We should probably have some lockin" [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [10:44:32] !log joal@deploy1002 Started deploy [airflow-dags/analytics@24d8d72]: Upgrade jobs to spark3 and add consistency [10:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:40] (03PS2) 10Jbond: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 [10:44:41] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@24d8d72]: Upgrade jobs to spark3 and add consistency (duration: 00m 09s) [10:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:47] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:46:34] (03CR) 10CI reject: [V: 04-1] reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond) [10:47:24] (03PS4) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:48:52] 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) [10:49:15] 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) p:05Triage→03High [10:49:36] (03PS5) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:51:27] (03PS1) 10Jaime Nuche: scap: fix bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/805340 (https://phabricator.wikimedia.org/T309713) [10:52:00] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35845/console" [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:52:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4004.ulsfo.wmnet [10:53:00] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet [10:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After migrating to 10.6', diff saved to https://phabricator.wikimedia.org/P29730 and previous config saved to /var/cache/conftool/dbconfig/20220614-105441-root.json [10:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4004.ulsfo.wmnet [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:17] (03CR) 10Jbond: Netbox: adapt ganeti-sync config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [10:58:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet [10:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:51] (03PS6) 10Volans: Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) [10:59:22] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:59:56] (03CR) 10Volans: "Addressed comment." [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [11:02:02] !log rebalancing ganeti cluster in esams T308238 [11:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:05] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [11:03:23] 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) [11:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T310011)', diff saved to https://phabricator.wikimedia.org/P29731 and previous config saved to /var/cache/conftool/dbconfig/20220614-110504-marostegui.json [11:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:10] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:06:33] (03CR) 10Jbond: [C: 03+1] Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [11:09:01] (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After migrating to 10.6', diff saved to https://phabricator.wikimedia.org/P29732 and previous config saved to /var/cache/conftool/dbconfig/20220614-110945-root.json [11:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:29] (03CR) 10JMeybohm: [C: 03+1] Add DNS CNAME records for datahub ingress on k8s [dns] - 10https://gerrit.wikimedia.org/r/805328 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [11:10:31] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1173.eqiad.wmnet with OS bullseye [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:08] (03CR) 10JMeybohm: [C: 03+1] "LGTM - don't forget to follow up with a transition to "state: production": https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_a" [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [11:17:21] hmm where'd you go stashbot.. [11:17:36] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) Cool, thanks! +1ed the first two. The service::catalog entries should be in stage production before switching trafficserver to the discovery record ju... [11:19:26] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:19:44] (03CR) 10Hnowlan: [C: 03+2] changeprop: Modify page denylist [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [11:19:56] (03CR) 10Ayounsi: [C: 03+1] "It might be worth keeping the "api" line (even with hardcoded port) the time everything is migrated, so migration (and rollback) would be " [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [11:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P29733 and previous config saved to /var/cache/conftool/dbconfig/20220614-112009-marostegui.json [11:23:15] (03Merged) 10jenkins-bot: changeprop: Modify page denylist [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [11:27:13] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [11:27:27] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [11:27:54] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [11:28:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [11:35:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P29734 and previous config saved to /var/cache/conftool/dbconfig/20220614-113515-marostegui.json [11:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T310011)', diff saved to https://phabricator.wikimedia.org/P29735 and previous config saved to /var/cache/conftool/dbconfig/20220614-115020-marostegui.json [11:50:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:50:24] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:15] (03CR) 10Muehlenhoff: [C: 03+2] scap: fix bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/805340 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [12:03:05] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung) [12:03:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T310011)', diff saved to https://phabricator.wikimedia.org/P29737 and previous config saved to /var/cache/conftool/dbconfig/20220614-120323-marostegui.json [12:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:28] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:05:31] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:01] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:36] (03CR) 10Hnowlan: [C: 03+2] Move Prometheus postgresql lag metric collector to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/792106 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:15:49] (03CR) 10Btullis: [C: 03+2] Add DNS CNAME records for datahub ingress on k8s [dns] - 10https://gerrit.wikimedia.org/r/805328 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [12:17:47] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:53] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:49] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:23] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:31] (03CR) 10Btullis: [C: 03+2] Add DataHub GMS and frontend services to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [12:20:17] (03PS1) 10KartikMistry: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) [12:20:31] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging" [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [12:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:09] PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:12] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01627 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:26:19] PROBLEM - cassandra-a CQL 10.192.0.220:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.220 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:26:29] PROBLEM - AQS root url on aqs2004 is CRITICAL: connect to address 10.192.0.212 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:26:33] PROBLEM - cassandra-b SSL 10.192.0.221:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:26:39] PROBLEM - cassandra-a SSL 10.192.0.220:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:27:07] PROBLEM - cassandra-a service on aqs2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:23] PROBLEM - cassandra-b CQL 10.192.0.221:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.221 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:27:47] PROBLEM - cassandra-b service on aqs2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:28:23] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove tendril leftover [puppet] - 10https://gerrit.wikimedia.org/r/805176 (owner: 10Muehlenhoff) [12:32:57] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:00] sending a fix to the puppet errors now, culprit is https://gerrit.wikimedia.org/r/c/operations/puppet/+/804311 [12:38:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2005.codfw.wmnet with OS buster [12:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:41] (03PS1) 10Jbond: cap: add version back [puppet] - 10https://gerrit.wikimedia.org/r/805374 (https://phabricator.wikimedia.org/T303559) [12:39:15] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2006.codfw.wmnet with OS buster [12:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2004.codfw.wmnet with OS buster [12:40:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35849/console" [puppet] - 10https://gerrit.wikimedia.org/r/805374 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:27] (03CR) 10Jbond: "fyi this broke puppet see https://gerrit.wikimedia.org/r/c/operations/puppet/+/805374" [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [12:40:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2007.codfw.wmnet with OS buster [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] cap: add version back [puppet] - 10https://gerrit.wikimedia.org/r/805374 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [12:41:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] cap: add version back [puppet] - 10https://gerrit.wikimedia.org/r/805374 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [12:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T310011)', diff saved to https://phabricator.wikimedia.org/P29738 and previous config saved to /var/cache/conftool/dbconfig/20220614-124139-marostegui.json [12:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:43] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:41:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2008.codfw.wmnet with OS buster [12:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:17] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:21] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2009.codfw.wmnet with OS buster [12:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:49] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005598 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:43:59] (03CR) 10Jaime Nuche: scap: remove scap Debian package from targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [12:44:56] (03CR) 10Jbond: scap: remove scap Debian package from targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [12:45:05] (03PS1) 10David Caro: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786) [12:45:07] (03PS1) 10David Caro: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786) [12:45:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS buster [12:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS buster [12:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS buster [12:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:09] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:44] (03CR) 10CI reject: [V: 04-1] wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [12:52:58] 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10fgiunchedi) [12:53:21] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [12:53:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:57] (03PS2) 10Muehlenhoff: Remove secteam-users group [puppet] - 10https://gerrit.wikimedia.org/r/805167 [12:56:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [12:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P29739 and previous config saved to /var/cache/conftool/dbconfig/20220614-125644-marostegui.json [12:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [12:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [12:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [12:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:41] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10fgiunchedi) Please see https://wikitech.wikimedia.org/wiki/Incidents/2022-06-14_overload_varnish_/_haproxy for the public incident report (we know what's going on, the report is light on details on... [12:59:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [12:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [12:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] hello! I can deploy today [13:00:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Ottomata) Approved! [13:01:25] MatmaRex: hi, are you around? [13:01:32] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [13:01:34] hello [13:01:35] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [13:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:43] let's start then :) [13:01:51] (03PS3) 10Urbanecm: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [13:02:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [13:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:23] (03PS2) 10Urbanecm: Make new topic tool available as opt-out almost everywhere (phase 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) (owner: 10Bartosz Dziewoński) [13:02:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove secteam-users group [puppet] - 10https://gerrit.wikimedia.org/r/805167 (owner: 10Muehlenhoff) [13:02:28] (03CR) 10Urbanecm: [C: 03+2] Make new topic tool available as opt-out almost everywhere (phase 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) (owner: 10Bartosz Dziewoński) [13:02:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [13:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:50] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005105 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:02:56] MatmaRex: starting with Make new topic tool available as opt-out almost everywhere , as it looks the other one is no-op until the DT patch is merged (so no need for testing) [13:03:07] yes [13:03:15] (03Merged) 10jenkins-bot: Make new topic tool available as opt-out almost everywhere (phase 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) (owner: 10Bartosz Dziewoński) [13:03:20] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:59] MatmaRex: 805245 is at mwdebug1001, can you have a look please? [13:04:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [13:04:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:47] urbanecm: looks good [13:04:51] syncing [13:05:01] (03PS4) 10Urbanecm: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [13:05:04] (03CR) 10Urbanecm: [C: 03+2] Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [13:05:36] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:05:58] (03Merged) 10jenkins-bot: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [13:06:09] ACKNOWLEDGEMENT - MD RAID on aqs2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.42. Check system logs on 10.192.16.42 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310610 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:06:14] 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10ops-monitoring-bot) [13:06:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [13:06:39] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [13:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:11] (03PS1) 10Slyngshede: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 [13:08:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7f2dc7296f0c25d00e45651c50c3e45733cc63b3: Make new topic tool available as opt-out almost everywhere (phrase 4; T310392) (duration: 03m 45s) [13:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:42] T310392: [Config Change] Enable New Topic Tool as opt-out at Phase 4 wikis (desktop) - https://phabricator.wikimedia.org/T310392 [13:08:53] MatmaRex: first one's live. the other one is syncing now. [13:08:53] PROBLEM - puppet last run on aqs2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.169: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:08:55] anything else? :) [13:09:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:06] (03PS1) 10Muehlenhoff: Update point of contact for three users [puppet] - 10https://gerrit.wikimedia.org/r/805382 [13:09:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [13:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:18] thanks urbanecm! [13:09:41] np [13:09:42] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35850/console" [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [13:09:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:03] (03PS1) 10Filippo Giunchedi: am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 [13:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] (03PS1) 10Filippo Giunchedi: am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) [13:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P29740 and previous config saved to /var/cache/conftool/dbconfig/20220614-131149-marostegui.json [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1692de09bf04c724cf416679405d4b6485550d40: Disable DiscussionTools visualenhancements feature in production (duration: 03m 25s) [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:55] MatmaRex: and, all done. anything else? [13:13:17] no, just the quick ones today [13:13:18] thanks [13:13:28] no problem! [13:13:42] !log UTC afternoon B&C window done [13:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:49] PROBLEM - dhclient process on aqs2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.189: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:14:13] RECOVERY - puppet last run on aqs2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:16:35] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:43] PROBLEM - puppet last run on aqs2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.189: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:17:06] (03CR) 10Muehlenhoff: [C: 03+2] Update point of contact for three users [puppet] - 10https://gerrit.wikimedia.org/r/805382 (owner: 10Muehlenhoff) [13:17:31] PROBLEM - AQS root url on aqs2007 is CRITICAL: connect to address 10.192.16.169 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [13:17:50] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10GoranSMilovanovic) @SLyngshede-WMF @KFrancis It's `goran.s.milovanovic@gmail.com`. [13:19:53] (03CR) 10Urbanecm: [C: 03+1] "This should be good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [13:21:26] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:21:38] PROBLEM - Host aqs2012 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:22] RECOVERY - puppet last run on aqs2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:22:24] RECOVERY - Host aqs2012 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [13:23:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:06] PROBLEM - cassandra-a service on aqs2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:26:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T310011)', diff saved to https://phabricator.wikimedia.org/P29741 and previous config saved to /var/cache/conftool/dbconfig/20220614-132654-marostegui.json [13:26:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:26:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:00] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:27:00] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:50] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:28:16] PROBLEM - cassandra-a CQL 10.192.48.198:9042 on aqs2012 is CRITICAL: connect to address 10.192.48.198 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:28:21] (03PS2) 10BCornwall: Traffic: Reorganize into more, smaller files [alerts] - 10https://gerrit.wikimedia.org/r/805241 [13:30:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:27] (03CR) 10BCornwall: [C: 03+2] Traffic: Reorganize into more, smaller files [alerts] - 10https://gerrit.wikimedia.org/r/805241 (owner: 10BCornwall) [13:33:27] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8001969, @Jgiannelos wrote: > Hey @fgiunchedi the size of the current active deployment is stabilized... [13:35:12] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:02] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:45] (03PS1) 10Muehlenhoff: Fix group membership [puppet] - 10https://gerrit.wikimedia.org/r/805387 (https://phabricator.wikimedia.org/T309383) [13:43:02] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:22] (03CR) 10Muehlenhoff: [C: 03+2] Fix group membership [puppet] - 10https://gerrit.wikimedia.org/r/805387 (https://phabricator.wikimedia.org/T309383) (owner: 10Muehlenhoff) [13:44:47] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) I notice that Goran has access to analytics_privatedata_users, is that still required? [13:44:54] RECOVERY - dhclient process on aqs2012 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:45:00] (03PS1) 10Slyngshede: Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 [13:46:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10MoritzMuehlenhoff) Sorry, there was something still missing. It should be fixed now, I have just merged the patch (but it will take up to 30 minutes for Puppe... [13:47:19] (03PS2) 10Slyngshede: Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055) [13:49:04] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:32] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:08] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10GoranSMilovanovic) @SLyngshede-WMF > I notice that Goran has access to analytics_privatedata_users, is that still required? I think it is. Na... [13:50:56] (03PS4) 10Slyngshede: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) [13:54:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2005.codfw.wmnet with OS buster [13:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:09] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) Then let's not revoke that :-) I think it's just the email that needed to be updated then. [13:56:20] (03CR) 10BCornwall: "Thank you for the eagle eyes" [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [13:56:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) (owner: 10Slyngshede) [14:00:16] RECOVERY - cassandra-a CQL 10.192.0.218:9042 on aqs2003 is OK: TCP OK - 0.033 second response time on 10.192.0.218 port 9042 https://phabricator.wikimedia.org/T93886 [14:01:58] 10SRE, 10Traffic, 10serviceops: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10CDanis) This error message comes from [[ https://www.envoyproxy.io/ | Envoy ]], which we use for internal cross-service TLS... [14:02:56] RECOVERY - cassandra-a service on aqs2004 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:28] (03CR) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [14:03:34] RECOVERY - cassandra-a SSL 10.192.0.220:7001 on aqs2004 is OK: SSL OK - Certificate aqs2004-a valid until 2024-06-07 14:43:44 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:04:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [14:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] RECOVERY - cassandra-a CQL 10.192.0.220:9042 on aqs2004 is OK: TCP OK - 0.032 second response time on 10.192.0.220 port 9042 https://phabricator.wikimedia.org/T93886 [14:05:16] RECOVERY - cassandra-a SSL 10.192.0.218:7001 on aqs2003 is OK: SSL OK - Certificate aqs2003-a valid until 2024-06-07 14:43:39 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:05:32] RECOVERY - cassandra-b SSL 10.192.0.221:7001 on aqs2004 is OK: SSL OK - Certificate aqs2004-b valid until 2024-06-07 14:43:46 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:05:58] RECOVERY - cassandra-b service on aqs2004 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:06:20] RECOVERY - cassandra-b CQL 10.192.0.221:9042 on aqs2004 is OK: TCP OK - 0.032 second response time on 10.192.0.221 port 9042 https://phabricator.wikimedia.org/T93886 [14:06:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [14:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:36] RECOVERY - cassandra-b CQL 10.192.0.219:9042 on aqs2003 is OK: TCP OK - 0.032 second response time on 10.192.0.219 port 9042 https://phabricator.wikimedia.org/T93886 [14:09:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4002.ulsfo.wmnet [14:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] RECOVERY - cassandra-b SSL 10.192.0.219:7001 on aqs2003 is OK: SSL OK - Certificate aqs2003-b valid until 2024-06-07 14:43:41 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:10:45] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Nemo_bis) Is this related to https://phabricator.wikimedia.org/T255366 ? [14:10:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2006.codfw.wmnet with OS buster [14:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:11:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:12:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2007.codfw.wmnet with OS buster [14:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4002.ulsfo.wmnet [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2008.codfw.wmnet with OS buster [14:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2011.codfw.wmnet with OS buster [14:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:52] (03PS1) 10Jforrester: Configure FilterProfiler cache separately [extensions/AbuseFilter] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805361 (https://phabricator.wikimedia.org/T212129) [14:15:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:08] (03PS1) 10Filippo Giunchedi: Enforce alert names with no spaces [alerts] - 10https://gerrit.wikimedia.org/r/805393 [14:16:36] (03CR) 10Filippo Giunchedi: "This is already the case, add a CI check too" [alerts] - 10https://gerrit.wikimedia.org/r/805393 (owner: 10Filippo Giunchedi) [14:16:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2009.codfw.wmnet with OS buster [14:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:46] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Jdforrester-WMF) [14:16:50] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [14:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is OK: TCP OK - 0.032 second response time on 10.192.16.186 port 9042 https://phabricator.wikimedia.org/T93886 [14:17:04] RECOVERY - cassandra-a service on aqs2007 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:18:10] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2010.codfw.wmnet with OS buster [14:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] (JobUnavailable) resolved: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:04] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung) [14:19:06] RECOVERY - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is OK: TCP OK - 0.032 second response time on 10.192.16.187 port 9042 https://phabricator.wikimedia.org/T93886 [14:20:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2012.codfw.wmnet with OS buster [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:10] 10SRE, 10SRE-Access-Requests: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10hashar) [14:22:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4003.ulsfo.wmnet [14:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10TheresNoTime) 05Open→03Resolved All working, thank you for your help @MoritzMuehlenhoff! :) [14:29:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4003.ulsfo.wmnet [14:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: reboots [14:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: reboots [14:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [14:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:08] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:52] (03PS1) 10Btullis: Promote the datahub services to production [puppet] - 10https://gerrit.wikimedia.org/r/805395 (https://phabricator.wikimedia.org/T303049) [14:36:06] RECOVERY - cassandra-a CQL 10.192.48.198:9042 on aqs2012 is OK: TCP OK - 0.033 second response time on 10.192.48.198 port 9042 https://phabricator.wikimedia.org/T93886 [14:37:23] (03CR) 10David Caro: [C: 03+1] "LGTM, just a nit if you want" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi) [14:38:45] jouncebot: nowandnext [14:38:45] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [14:38:45] In 1 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1600) [14:40:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: reboots [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: reboots [14:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:02] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:18] (03PS1) 10Tchanders: WIP Set $wgSimilarEditorsApiUrl to URL for Similarusers service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805396 [14:43:43] (03PS1) 10Urbanecm: Add new throttle rule + remove expired one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805398 (https://phabricator.wikimedia.org/T310625) [14:44:26] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule + remove expired one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805398 (https://phabricator.wikimedia.org/T310625) (owner: 10Urbanecm) [14:45:12] (03Merged) 10jenkins-bot: Add new throttle rule + remove expired one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805398 (https://phabricator.wikimedia.org/T310625) (owner: 10Urbanecm) [14:47:23] (03PS2) 10ArielGlenn: Point Wikimedia Enterprise HTML Dumps to trial API features [puppet] - 10https://gerrit.wikimedia.org/r/805223 (https://phabricator.wikimedia.org/T310075) (owner: 10Marcelo1251) [14:49:03] (03CR) 10ArielGlenn: [C: 03+2] Point Wikimedia Enterprise HTML Dumps to trial API features [puppet] - 10https://gerrit.wikimedia.org/r/805223 (https://phabricator.wikimedia.org/T310075) (owner: 10Marcelo1251) [14:49:17] (03PS2) 10Filippo Giunchedi: am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 [14:49:19] (03PS2) 10Filippo Giunchedi: am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) [14:49:45] (03CR) 10Filippo Giunchedi: am: use SafeLoader for team regexes (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi) [14:49:48] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 596058b5e4d906d40e620fe5b01f37c484f5a8c1: Add new throttle rule + remove expired one (T310625) (duration: 03m 38s) [14:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:54] T310625: Request a throttle lift for Czech editaton in Prague -- 2022-06-20 - https://phabricator.wikimedia.org/T310625 [14:52:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:53:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] !log failover ganeti master in ulsfo to ganeti4003 [14:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:08] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:10] PROBLEM - Host mc1051 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:17] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805395 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [14:57:18] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055) (owner: 10Slyngshede) [14:58:14] PROBLEM - ganeti-wconfd running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:59:06] (03PS3) 10Muehlenhoff: data.yaml: Update realname [puppet] - 10https://gerrit.wikimedia.org/r/805380 (owner: 10Samtar) [15:00:02] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:16] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logster-badpass_priv.service,logster-csp.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:10] (03CR) 10Muehlenhoff: [C: 03+2] data.yaml: Update realname [puppet] - 10https://gerrit.wikimedia.org/r/805380 (owner: 10Samtar) [15:06:40] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:04] RECOVERY - Host mc1051 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [15:07:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Ottomata) [15:07:12] (03PS5) 10Ottomata: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) (owner: 10Slyngshede) [15:07:20] (03CR) 10JMeybohm: [C: 03+1] "jayme@cp2031:~$ host datahub-frontend.discovery.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/805331 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:09:00] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [15:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:13] (03PS6) 10Ottomata: Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) (owner: 10Slyngshede) [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:29] ACKNOWLEDGEMENT - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:30] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:30] ACKNOWLEDGEMENT - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:31] ACKNOWLEDGEMENT - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service Hnowlan fix in review https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Shell access for xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/805322 (https://phabricator.wikimedia.org/T310555) (owner: 10Slyngshede) [15:18:44] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) (owner: 10Filippo Giunchedi) [15:19:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host elastic2053.codfw.wmnet [15:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host elastic2053.codfw.wmnet [15:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:01] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Andrew) Thank you @BTullis ! Please let me know if this turns into a long-term project in which case I'll just plan to build these hosts... [15:34:37] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [15:35:22] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Ottomata) [15:43:04] (03PS1) 10Ottomata: Add xcollazo to platform-engineering posix group [puppet] - 10https://gerrit.wikimedia.org/r/805410 (https://phabricator.wikimedia.org/T310555) [15:44:29] (03CR) 10CI reject: [V: 04-1] Add xcollazo to platform-engineering posix group [puppet] - 10https://gerrit.wikimedia.org/r/805410 (https://phabricator.wikimedia.org/T310555) (owner: 10Ottomata) [15:48:00] (03PS2) 10Ottomata: Add xcollazo to platform-engineering posix group [puppet] - 10https://gerrit.wikimedia.org/r/805410 (https://phabricator.wikimedia.org/T310555) [15:52:41] (03CR) 10Ottomata: [C: 03+2] Add xcollazo to platform-engineering posix group [puppet] - 10https://gerrit.wikimedia.org/r/805410 (https://phabricator.wikimedia.org/T310555) (owner: 10Ottomata) [15:53:03] (03PS1) 10Andrea Denisse: pontoon: Add netmon02 [puppet] - 10https://gerrit.wikimedia.org/r/805414 [15:53:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Ottomata) 05Open→03Resolved Thanks all! I was in a meeting with Xabriel so expedited this. Confirmed that Xabriel has access now. Resolving. [15:54:13] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Add netmon02 [puppet] - 10https://gerrit.wikimedia.org/r/805414 (owner: 10Andrea Denisse) [15:55:36] (03PS2) 10Andrea Denisse: pontoon: Add netmon02 [puppet] - 10https://gerrit.wikimedia.org/r/805414 [15:58:14] (03CR) 10Brennen Bearnes: [C: 03+2] "Merging as wmf.16 is not yet checked out on deploy1002; shouldn't require deployment." [extensions/AbuseFilter] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805361 (https://phabricator.wikimedia.org/T212129) (owner: 10Jforrester) [15:58:41] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2053.codfw.wmnet with reason: host reimage [15:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:38] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Add netmon02 [puppet] - 10https://gerrit.wikimedia.org/r/805414 (owner: 10Andrea Denisse) [16:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1600). [16:00:04] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:14] o/ [16:00:32] I just found two old branches in my puppet.git and figured I’d rebase them and add them to the window ^^ [16:00:45] neither is required for anything, just nice-to-haves [16:00:48] fyi we get a ack looking now Lucas_WMDE [16:01:16] beaten to it, thanks jbond :) [16:01:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2053.codfw.wmnet with reason: host reimage [16:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:13] (03CR) 10Jbond: [C: 03+2] puppet_alert: Improve message [puppet] - 10https://gerrit.wikimedia.org/r/791559 (owner: 10Lucas Werkmeister (WMDE)) [16:02:21] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/791559 (owner: 10Lucas Werkmeister (WMDE)) [16:02:33] np :) [16:03:21] Lucas_WMDE: just going to ask someone in cloud to give the wmcs a sanity and make sure they are aware [16:03:29] ok [16:05:13] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:05:18] !log jnuche@deploy1002 Installing scap version "4.9.2" for 557 hosts [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:57] (03CR) 10Majavah: [C: 03+1] maintain-meta_p: stop reading VariantSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/665116 (owner: 10Lucas Werkmeister (WMDE)) [16:06:53] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:56] (03CR) 10Jbond: [C: 03+2] maintain-meta_p: stop reading VariantSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/665116 (owner: 10Lucas Werkmeister (WMDE)) [16:08:28] Lucas_WMDE: both merged thanks [16:09:05] thanks! [16:11:53] !log jnuche@deploy1002 Installing scap version "4.9.2" for 557 hosts [16:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:13] !log jnuche@deploy1002 Installation of scap version "4.9.2" completed for 557 hosts [16:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:35] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:09] (03Merged) 10jenkins-bot: Configure FilterProfiler cache separately [extensions/AbuseFilter] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805361 (https://phabricator.wikimedia.org/T212129) (owner: 10Jforrester) [16:18:06] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:20:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:20:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:40] (03CR) 10Btullis: [C: 03+2] Promote the datahub services to production [puppet] - 10https://gerrit.wikimedia.org/r/805395 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [16:23:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2053.codfw.wmnet with OS bullseye [16:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:12] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye completed... [16:24:50] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, 10User-jbond: Add monitoring for t he puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10jbond) [16:25:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, 10User-jbond: Add monitoring for t he puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10jbond) p:05Triage→03Medium [16:25:58] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/805419 [16:33:47] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/805419 (owner: 10Ahmon Dancy) [16:33:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, 10User-jbond: Add monitoring for the puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10Ilovemydoodle2) [16:35:11] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/805419 (owner: 10Ahmon Dancy) [16:38:01] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10brennen) [16:43:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:44] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:01] (03CR) 10Milimetric: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [16:55:58] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:33] (03PS3) 10Milimetric: Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) [16:59:28] (03CR) 10CI reject: [V: 04-1] Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [17:01:10] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:28] (03PS1) 10Ssingh: aptrepo: add repository component for bird2 [puppet] - 10https://gerrit.wikimedia.org/r/805448 (https://phabricator.wikimedia.org/T310574) [17:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:08:23] (03PS3) 10Jdlrobson: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [17:08:50] (03CR) 10Jdlrobson: [C: 03+1] Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [17:12:58] !log train 1.39.0-wmf.16 (T308069): train is blocked - will sync to testwikis and hold there for resolution of T310532 [17:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:04] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [17:13:04] T310532: Investigate McRouter GET request spike from wmf.15 - https://phabricator.wikimedia.org/T310532 [17:15:04] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:59] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Cmjohnson) @marostegui at first glance the server was hanging up during the boot process at memory configuration, I did not get any hardware errors, I put the server down to minimum post requirements (1 CPU and 1... [17:20:35] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805449 [17:20:37] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805449 (owner: 10Brennen Bearnes) [17:21:20] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805449 (owner: 10Brennen Bearnes) [17:22:19] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.16 [17:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:23:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:37] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:38] RECOVERY - Host cp1089 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [17:27:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The server was out of warranty, I swapped DIMM B1 with a DIMM from a spare. Server booted, no issues. [17:29:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:30:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:50] (03PS4) 10Milimetric: Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) [17:32:56] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:16] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:36:04] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:38:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:30] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:21] 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10CDanis) My 2 cents: * Allowing Shellbox to burst beyond its cpu limit seems like the right first, easy thing to try. There's little risk to enabling this for a few services, and (AFAIK?) depending o... [17:41:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10ssingh) >>! In T310387#8003619, @Cmjohnson wrote: > The server was out of warranty, I swapped DIMM B1 with a DIMM from a spare. Server booted, no issues. Thanks for your help @Cmjohnson! [17:45:00] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:50] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:20] PROBLEM - Host mc1053 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:39] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) 05Open→03Resolved The motherboard was replaced, and after fixing some bios settings the server is now back online. Thanks for your patience. [17:47:42] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10Cmjohnson) [17:49:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:11] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.16 (duration: 32m 52s) [17:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:07] !log brennen@deploy1002 Pruned MediaWiki: 1.39.0-wmf.14 (duration: 01m 53s) [17:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:40] RECOVERY - Host mc1053 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [17:58:46] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:05] brennen and jeena: #bothumor I īŋŊ Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1800). [18:00:16] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [18:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:29] o/ - train's still blocked, holding at test wikis until unblocked. [18:03:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:03:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:08] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:19] (03CR) 10Milimetric: "spdx headers added, I think this is ready to merge" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [18:06:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:48] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:50] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:14:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:56] !log ayounsi@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=imagescaler-ro,name=codfw [18:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:31] er, I didn't mean to do that [18:16:52] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:22] nevermind, it didn't do anything [18:21:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] PROBLEM - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100% [18:27:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:56] RECOVERY - Host ms-be1059 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:28:02] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [18:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye [18:30:35] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye [18:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye execut... [18:31:40] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:58] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster [18:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:31] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster [18:41:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:47:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS buster [18:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:54] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1017.eqiad.wmnet with OS buster [18:47:58] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:27] ^ wuh oh, if we end up having to depool another maps host that might be a problem [18:49:29] cc hnowlan [18:51:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS buster [18:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:36] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1018.eqiad.wmnet with OS buster [18:52:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster [18:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:09] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1019.eqiad.wmnet with OS buster [18:52:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1019.eqiad.wmnet with OS buster [18:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:16] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1019.eqiad.wmnet with OS buster executed with errors: - aqs1019... [18:52:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS buster [18:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:44] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1020.eqiad.wmnet with OS buster [18:53:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS buster [18:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:13] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1021.eqiad.wmnet with OS buster [18:57:04] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:48] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:02] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:41] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:41] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:42] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [19:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:17] 10SRE, 10SRE-Access-Requests: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10hashar) [19:13:10] (03CR) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [19:14:44] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [19:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:44] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:16] (03PS1) 10Ahmon Dancy: Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T303857) [19:24:40] (03PS8) 10Legoktm: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [19:24:42] (03PS7) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [19:24:44] (03PS1) 10Legoktm: mediawiki: Fix updatequerypages jobs for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/805465 [19:26:09] (03PS8) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [19:26:11] (03PS2) 10Legoktm: mediawiki: Fix updatequerypages jobs for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/805465 [19:27:25] (03CR) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [19:28:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:38] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [19:32:39] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [19:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:42] (03PS2) 10BCornwall: Traffic: add varnishkafka delivery error alarms [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [19:34:03] (03CR) 10BCornwall: Traffic: add varnishkafka delivery error alarms (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [19:34:40] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:16] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [19:36:18] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [19:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:44] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:23] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [19:40:24] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [19:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:10] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:44] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:48] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:48] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1016.eqiad.wmnet with OS buster [19:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:53] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster executed with errors: - aqs1016... [19:56:10] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:18] indeed, nothing to do! [20:01:13] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1017.eqiad.wmnet with OS buster [20:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:19] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1017.eqiad.wmnet with OS buster executed with errors: - aqs1017... [20:02:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:50] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1018.eqiad.wmnet with OS buster [20:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:45] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1018.eqiad.wmnet with OS buster executed with errors: - aqs1018... [20:05:11] (03PS1) 10Ayounsi: Netbox: only run CSV dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) [20:06:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1020.eqiad.wmnet with OS buster [20:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:33] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1020.eqiad.wmnet with OS buster executed with errors: - aqs1020... [20:06:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1021.eqiad.wmnet with OS buster [20:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:58] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1021.eqiad.wmnet with OS buster executed with errors: - aqs1021... [20:08:41] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:51] (03CR) 10Brennen Bearnes: [C: 03+1] Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T303857) (owner: 10Ahmon Dancy) [20:11:53] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35851/" [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [20:16:44] (03CR) 10Dzahn: "This is an access request / a change to an admin group. If it's just adding a new user to an existing group and has approval from the grou" [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T303857) (owner: 10Ahmon Dancy) [20:16:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) @BTullis Hey Ben as of right now i do not have any space in 10g racks for 2u in A-D can all 4 of these be in row E & F not sharing racks ? dse-k8s-worker1005 -> row e1 dse-... [20:16:57] (03CR) 10BCornwall: "Considering this seems to be part of analytics, would it make sense to append this to team-data-engineering/varnishkafka.yaml and changing" [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [20:18:22] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:18:24] (03CR) 10Daniel Kinzler: [C: 04-2] "Blocking this because we probably just want to remove rpc/RunJobs.php. Want to make a patch for that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [20:18:44] (03CR) 10BCornwall: Traffic: add varnishkafka delivery error alarms (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [20:18:47] (03CR) 10Ahmon Dancy: Allow deployers to sudo -u mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T303857) (owner: 10Ahmon Dancy) [20:18:48] urbanecm are you still available to deploy some patches? [20:19:04] DannyS712: if they're simple ones :)) [20:19:10] phpcs cleanup, so yes [20:19:21] can you add them to the calendar? [20:19:25] doing now [20:19:36] (03CR) 10Eevans: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [20:19:52] (03CR) 10DannyS712: phpcs: move AssignmentInControlStructures exclusion inline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:19:55] (03PS7) 10DannyS712: phpcs: move AssignmentInControlStructures exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) [20:20:02] (03PS4) 10DannyS712: phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) [20:20:03] ping me once it's there :) [20:20:39] urbanecm added all 6 [20:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:21:00] (03PS5) 10DannyS712: phpcs: enable and fix FunctionComment.WrongStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802841 (https://phabricator.wikimedia.org/T171115) [20:21:11] (03PS8) 10DannyS712: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) [20:21:14] DannyS712: while I'm deploying them, by any chance, do you know why would TextInputWidget behaves like in https://ctrlv.tv/DAE9? Patch's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/805470. [20:21:54] will take a look [20:21:57] (03CR) 10DannyS712: phpcs: enable and configure PrefixedGlobalFunctions.allowedPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:22:00] (03PS5) 10DannyS712: phpcs: enable and configure PrefixedGlobalFunctions.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) [20:22:04] (03CR) 10Urbanecm: [C: 03+2] phpcs: move AssignmentInControlStructures exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:22:06] (03PS5) 10DannyS712: phpcs: enable and fix MisleadingGlobalNames.Misleading$wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) [20:22:12] (03CR) 10Urbanecm: [C: 03+2] phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:22:28] (03CR) 10Urbanecm: [C: 03+2] phpcs: enable and fix FunctionComment.WrongStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802841 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:22:35] (03CR) 10Urbanecm: [C: 03+2] phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:23:16] what specifically is the issue with the text input widget? [20:23:27] (03Merged) 10jenkins-bot: phpcs: move AssignmentInControlStructures exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:23:39] (03Merged) 10jenkins-bot: phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:23:42] (03Merged) 10jenkins-bot: phpcs: enable and fix FunctionComment.WrongStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802841 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:23:49] DannyS712: if you focus on the label (the one telling you how many characters you have left), it covers the field content, until i added the space. [20:23:49] (03Merged) 10jenkins-bot: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:24:01] looks like this at the begining https://usercontent.irccloud-cdn.com/file/5hmyKnfO/image.png [20:24:17] ah, got it [20:24:31] (03CR) 10Urbanecm: [C: 03+2] phpcs: enable and configure PrefixedGlobalFunctions.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:24:37] (03CR) 10Urbanecm: [C: 03+2] phpcs: enable and fix MisleadingGlobalNames.Misleading$wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:25:26] (03Merged) 10jenkins-bot: phpcs: enable and configure PrefixedGlobalFunctions.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:25:29] (03Merged) 10jenkins-bot: phpcs: enable and fix MisleadingGlobalNames.Misleading$wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:27:53] syncing the changes [20:29:03] 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Allow deployers to sudo -u mwpresync - https://phabricator.wikimedia.org/T310654 (10dancy) [20:29:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Allow deployers to sudo -u mwpresync - https://phabricator.wikimedia.org/T310654 (10dancy) [20:30:27] (03PS2) 10Ahmon Dancy: Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) [20:31:04] (03CR) 10Ahmon Dancy: Allow deployers to sudo -u mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy) [20:31:10] !log urbanecm@deploy1002 Synchronized wmf-config/: phpcs cleanups (T171115; no-op for production) (duration: 03m 38s) [20:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:15] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:32:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster [20:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:56] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster [20:33:01] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1016.eqiad.wmnet with OS buster [20:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:06] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster executed with errors: - aqs1016... [20:33:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:33:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:44] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:05] 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) [20:34:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:39] !log urbanecm@deploy1002 Synchronized multiversion/: phpcs cleanups (T171115; no-op for production) (duration: 03m 28s) [20:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:48] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) @Eevans we can change these to Buster. I do not know why they failed installation, I monitored the entire time, the OS install on the hard drives but it... [20:36:12] (03CR) 10Thcipriani: [C: 03+1] Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy) [20:36:36] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:36:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:55] !log urbanecm@deploy1002 Synchronized w/: phpcs cleanups (T171115; no-op for production) (duration: 03m 15s) [20:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:58] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:40:04] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:15] urbanecm I tried out the live preview at https://doc.wikimedia.org/oojs-ui/master/js/#!/api/OO.ui.TextInputWidget and for me the label always remained the way you had it at the start (over the text and basically unreadable) until I added a setTimeout() call to change the label, that fixed it. So my guess is that it only displays correctly once the [20:41:15] label renders a second time (since if the setTimeout() sets the label to the same as the existing text it doesn't work). I suggest adding `setTimeout( this.updateRemainingMessageLength.bind( this ) )` instead of calling `this.updateRemainingMessageLength()` directly. It wasn't because you changed the text, but rather because the label updated due [20:41:15] to the change in the number of characters [20:41:36] !log urbanecm@deploy1002 Synchronized docroot/: phpcs cleanups (T171115; no-op for production) (duration: 03m 41s) [20:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] DannyS712: thanks for looking at it! Can you clarify what does `this.updateRemainingMessageLength.bind( this )` do and why is it necessary to wrap it in setTimeout () with no delay (AFAIK, that means "execute immediately"?)? It's not clear to me. [20:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:21] DannyS712: fyi, all patches should be deployed. [20:44:48] still need to sync tests/ and phpcs.xml ? [20:45:40] i routinely ignore those two folders (they're not loaded by anything in prod) [20:46:34] `this.updateRemainingMessageLength.bind( this )` means "the function this.updateRemainingMessageLength with the *current* `this` as the value of `this` when the function is executed". Its a fancier way of writing `const that = this; setTimeout( function () { that.updateRemainingMessageLength(); } );`. See [20:46:35] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_objects/Function/bind. As for why the setTimeout() makes it work, no idea? Maybe because it means that it gets executed after any existing code in the stack is executed, which means that it renders the element, then updates the label, which fixes it? [20:46:49] re syncing, okay, just wanted to make sure [20:47:00] no problem, thanks for double checking what i did [20:48:00] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:55] DannyS712: thanks for the clarification, that does make sense. I'll try it and see. [20:50:56] (03CR) 10Ayounsi: [V: 03+1] "If I understand correctly, until this is deployed, we're running CSV dumps from 4 different servers against netbox.wikimedia.org (and thus" [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [20:51:22] (03PS1) 10Andrew Bogott: galera nodecheck.sh: don't wipe out errlog on each run [puppet] - 10https://gerrit.wikimedia.org/r/805475 [20:54:16] (03PS1) 10TheDJ: Use the PDF cropbox for rendering [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) [20:54:31] (03CR) 10Andrew Bogott: [C: 03+2] galera nodecheck.sh: don't wipe out errlog on each run [puppet] - 10https://gerrit.wikimedia.org/r/805475 (owner: 10Andrew Bogott) [20:56:57] DannyS712: hmm, if i put 10s delay to setTimeout, open the dialog and wait, it works. if i just wrap it in setTimeout, it doesn't :/. putting it to getSetupProcess helps though, looks to be late enough for the label update to work properly. thanks again :) [21:03:39] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [21:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:07] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) The other day I have deleted cpt-leads@ (after Tim told me it's ok and not used anymore since a while) and techcom@ (after asking ITS to create it on... [21:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:09:58] (03PS1) 10Andrew Bogott: OpenStack haproxy: make Galera checks half as often [puppet] - 10https://gerrit.wikimedia.org/r/805478 [21:10:01] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [21:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:37] > hmm, if i put 10s delay to setTimeout, open the dialog and wait, it works. if i just wrap it in setTimeout, it doesn't :/. putting it to getSetupProcess helps though, looks to be late enough for the label update to work properly. thanks again :) [21:10:38] 10s feels like a lot, not sure why the delay is needed, but surely 1 second should be enough? [21:10:46] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805430 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [21:11:25] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack haproxy: make Galera checks half as often [puppet] - 10https://gerrit.wikimedia.org/r/805478 (owner: 10Andrew Bogott) [21:11:34] urbanecm also I know the deployment window is over, but any chance you want to merge a few more phpcs cleanups? Deployment calendar is clear for the next 10 hours [21:12:29] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [21:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [21:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:42] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:23] (03PS1) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [21:21:21] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10ssastry) [21:23:30] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [21:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:53] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [21:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:18] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [21:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:06] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:33:16] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:03] DannyS712: sorry, was afk. which ones? :) [21:35:15] 10s was only so i manage to open the dialog in time [21:37:32] even when it's 10s, but in initialize, it doesn't work. if i add it to getSetupProcess, it does (even when called immediately). not sure why though. [21:38:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:40] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [21:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:58] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:39:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:17] urbanecm only https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805430 so far, will send a follow up to that shortly [21:40:58] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:41] DannyS712: we might want to just remove the //sanity comment per https://www.mediawiki.org/wiki/Inclusive_language, what do you think? [21:43:26] (03PS3) 10DannyS712: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805430 (https://phabricator.wikimedia.org/T171115) [21:43:27] done [21:44:16] (03CR) 10Urbanecm: [C: 03+2] phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805430 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [21:44:31] thanks, looks good, merging & deploying [21:44:40] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:07] (03Merged) 10jenkins-bot: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805430 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [21:45:42] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:17] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [21:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:23] the other patch will be at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805431 once I make it [21:49:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [21:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:27] !log urbanecm@deploy1002 Synchronized docroot/: ca3b94f2d9bc755d92839e5e69072615ea9008df: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 38s) [21:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:31] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [21:53:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:53:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:55] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [21:54:56] !log urbanecm@deploy1002 Synchronized multiversion/: ca3b94f2d9bc755d92839e5e69072615ea9008df: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 29s) [21:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:17] scap says `21:55:58 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807 (ran as mwdeploy@wtp1026.eqiad.wmnet) returned [2]: Restarting php7.2-fpm: free opcache 720 MB, 2022-06-14 21:55:58,576 [ERROR] Error running command with poolcounter: timed out`, can someone check whether wtp1026 is up properly? [21:58:27] !log urbanecm@deploy1002 Synchronized rpc/: ca3b94f2d9bc755d92839e5e69072615ea9008df: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 31s) [21:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:32] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [21:58:51] complete output https://www.irccloud.com/pastebin/eqim5W9r/ [21:59:58] (actually, since this is from a scap for loop i started, let's see if it repeats now?) [22:00:07] !log wtp1026 - manually running '/usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' [22:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] 2022-06-14 22:00:02,006 [INFO] Restarting the service [22:00:18] 2022-06-14 22:00:02,220 [INFO] Repooling previously pooled services [22:00:21] done [22:00:37] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:40] thanks mutante. so, hopefully just a temporary issue [22:00:57] * urbanecm just realized he can ssh into the host and re-run the command himself [22:00:57] given that it was one random server, yea [22:01:14] I had to add that long number [22:01:20] Restarting php7.2-fpm: free opcache 717 MB [22:01:59] "LB lvs1019:9090 reports pool parsoid-php_443/wtp1026.eqiad.wmnet as enabled/up/pooled, should be disabled/*/not pooled" [22:02:00] !log urbanecm@deploy1002 Synchronized src/: ca3b94f2d9bc755d92839e5e69072615ea9008df: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 32s) [22:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:01] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:02] that LB report looks suspicious to me [22:03:24] server is pooled [22:03:52] (03PS1) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) [22:04:00] it's just a warning that is a reply to the "Depooling currently pooled services" i think [22:04:07] ah [22:05:19] !log urbanecm@deploy1002 Synchronized w/: ca3b94f2d9bc755d92839e5e69072615ea9008df: phpcs: start to fix SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 18s) [22:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:24] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [22:05:36] DannyS712: synced :) [22:06:51] urbanecm thanks. second patch at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805431 is ready [22:07:01] it's WIP for me [22:07:06] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805431 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [22:07:12] fixed [22:07:14] thanks [22:07:56] DannyS712: fyi generally, similar changes are easier to do in one big batch (with the previous six ones, i deployed them all at once). so, just for the future :) [22:08:39] yeah, but that makes it harder for me to make the patch and increases the chances of a merge conflict, so I split them up but have them scheduled for deployment together so that you don't need a sync for each [22:09:19] i didn't mean squashing it into one patch. i meant requesting deployment together (like you did with the six ones during the window). [22:09:58] (03CR) 10Urbanecm: [C: 03+2] phpcs: fix more SpaceBeforeSingleLineComment.NewLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805431 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [22:10:02] oh. well these I am writing in real time since you were available and the deployment calendar was open. If you want to wait on merging the current patch, I plan to send another [22:10:32] or if you want to wait on syncing :) [22:10:33] but also this patch and the one before it only overlap on phpcs.xml so it would be the same syncs anyway [22:10:34] DannyS712: actually, I'd prefer if this is the last cleanup for today. it's getting late, and I need to get up early tomorrow :)). can we continue tomorrow? [22:10:40] sure [22:10:48] (03Merged) 10jenkins-bot: phpcs: fix more SpaceBeforeSingleLineComment.NewLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805431 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [22:10:51] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:55] thanks [22:11:09] I'll still make the patches to prepare. Should I add the two that you already merged to the deployment calendar? [22:11:40] not needed IMO, we're outside of the window anyway [22:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [22:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:15:03] !log urbanecm@deploy1002 Synchronized wmf-config/: e3fe6c04c95717f0f914bbfa366f5f827f392b6b: phpcs: fix more SpaceBeforeSingleLineComment.NewLineComment (T171115) (duration: 03m 39s) [22:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:08] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [22:15:15] DannyS712: and, live. see you later :) [22:16:22] thanks [22:16:30] and have a great night [22:17:30] thanks :) [22:17:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:18:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:55] (03CR) 10Urbanecm: [C: 03+1] Structured task: enable free text for "other" rejection reason (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [22:19:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:15] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:03] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:05] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:27] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:18] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [22:44:50] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) there is always moar:) - deleted mobile@wikimedia.org - forwarded to inactive mailman list - deleted engineering@wikimedia.org - forwarded to non-exi... [22:48:10] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:47] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:51] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:39] 10SRE, 10MediaWiki-Shell, 10WMF-General-or-Unknown, 10Security, 10Sustainability (Incident Followup): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10TheDJ) [23:08:01] (03CR) 10Dzahn: [C: 04-1] "I looked at this and indeed it seems Jelto's concerns have been adressed and given he said he'd be ok using the docker network if it's for" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:08:09] !log disabling puppet in gitlab-runners (via cumin /disable-puppet) before deploying gerrit:791655 to provide gitlab-runners with buildkit and new docker network - T308271 [23:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:13] T308271: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 [23:09:15] (03PS2) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [23:09:29] (03CR) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [23:09:49] (03PS12) 10Dzahn: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:10:14] (03CR) 10Dzahn: "PS12: removed quotes from port number in modules/buildkitd/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:11:36] (03CR) 10Dzahn: [V: 03+1] "compiles now: https://puppet-compiler.wmflabs.org/pcc-worker1001/35853/" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:12:02] (03PS13) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [23:12:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:41] (03CR) 10Dduvall: "Ah, you beat me to it! Thanks, Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:13:38] (03CR) 10Dzahn: [C: 03+2] "disabled puppet on gitlab-runners*. merging carefully only on first runner only" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:14:59] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:16:27] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "failed: '/usr/bin/docker network create --driver='bridge' --subnet='172.20.0.0/16' 'gitlab-runner'' returned 1 instead of one of [0]" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:16:48] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "This was kind of my concern and it turned out to be true. creating the docker network fails" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:17:21] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [23:17:57] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "/usr/bin/docker network create --driver='bridge' --subnet='172.20.0.0/16' 'gitlab-runner'" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:18:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:57] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:12] (03CR) 10Dduvall: "Thanks for attempting to deploy this, and sorry for the failure. It's not clear to me how we could have tested this prior to deployment. I" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:21:29] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Provide buildkitd to GitLab runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:28:47] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Provide buildkitd to GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:30:53] (03PS1) 10Dzahn: Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 [23:32:21] !log gitlab-runner1001 - restarting docker [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:59] dduvall: lol, i found a fix though [23:33:37] https://github.com/docker/compose/issues/4873 said "same issue on Ubuntu and was able to fix by restarting Docker" so I tried that and ..it's true [23:33:56] (03CR) 10CI reject: [V: 04-1] Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 (owner: 10Dzahn) [23:34:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "I found https://github.com/docker/compose/issues/4873 which said ""same issue on Ubuntu and was able to fix by restarting Docker" so I tri" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:35:03] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:47] PROBLEM - Check systemd state on gitlab-runner1001 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:55] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:31] !log gitlab-runner1001 - systemctl start buildkitd [23:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:27] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:40:57] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:42:13] (03PS2) 10Dzahn: Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 [23:43:34] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "docker: Error response from daemon: Conflict. The container name "/buildkitd" is already in use by container "d482bdd745ac6b65f65c82f4d0fa" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:45:42] (03PS3) 10Dzahn: Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 [23:46:46] (03CR) 10Dzahn: [C: 03+2] Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 (owner: 10Dzahn) [23:48:53] (03PS4) 10Dzahn: Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 (https://phabricator.wikimedia.org/T308271) [23:48:55] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:26] !log gitlab-runner1002 - systemctl restart docker; run-puppet-agent ; systemctl start buildkitd - fails though T308271 [23:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:31] T308271: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 [23:49:46] (03CR) 10Dzahn: [V: 03+2] Revert "Provide buildkitd to GitLab runners" [puppet] - 10https://gerrit.wikimedia.org/r/805434 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:52:05] !log gitlab-runner1001/1002 - clean revert not possible, icinga alerting about failed buildkitd service, manually deleting systemd unit and trying to clean up T308271 [23:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:41] RECOVERY - Check systemd state on gitlab-runner1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:59] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state