[00:00:01] (03CR) 10Dzahn: [C: 04-1] "I see an "outdir" variable changing.. not good" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [00:00:04] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T0000). [00:00:04] nray and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:19] hello o/ [00:00:33] hey o/ [00:01:42] 10SRE, 10ops-eqiad, 10serviceops-radar: mw1448.mgmt alert - https://phabricator.wikimedia.org/T296041 (10Dzahn) @Jclark-ctr thanks! confirmed working :) [00:02:52] I can deploy today, since no one else is here [00:03:01] urbanecm: thank you! [00:03:12] hi nray: + zabe: -- i'm hoping one of the seasoned deployers will show up but if not, i'm happy to do it [00:03:31] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) @akosiaris @Papaul thanks! ACK, we are done here then :)) [00:03:45] thanks cjming , it sounds like urbanecm is willing [00:03:59] thanks urbanecm \o/ [00:05:02] cjming: honestly, if you can deploy, I'd prefer to. It's 1am for me now, and while I'm still capable of handling it, i wouldn't mind someone else doing it :) [00:06:46] zabe: can you quickly explain why is a i18n patch backport needed? [00:07:20] urbanecm: ok - np! sorry i didn't realize it was so late for you [00:08:29] cjming: yeah, part of the reasons why I'm negotiating a new schedule with releng. [00:08:37] happy to stay up for a bit if that makes you more comfortable deploying [00:09:42] urbanecm: I introduced Special:Delete and Special:Protect in a patch and because i didn't think completely about what i was doing, i used the 'delete' and the 'protect' key which actually are already used. Diligent translators then went and translated the new messages, so now the 'delete' and 'protect' message is wrong in about 30 languages. [00:09:59] not good :/ [00:10:17] and why is the revert only in wmf.9? [00:10:23] or am i missing a master commit? [00:10:55] (03CR) 10Urbanecm: [C: 03+2] Restore ReadingDepth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740613 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:11:01] (03CR) 10Urbanecm: [C: 03+2] Update access_method value in reading depth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740690 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:11:07] +2'ing the backports to give CI time [00:11:13] we fixed that in master by renaming the special pages and start using different message keys. But backporting that is not really scap friendly. [00:11:55] urbanecm: i think i can handle it from here - if we run into a snag that idk how to resolve, i'll ping releng [00:12:14] cjming: sounds good. Not sure if you ever had to sync i18n though [00:12:24] (a full scap is needed for that) [00:12:40] (full scap = scap sync-world, ie. "sync everything") [00:13:19] hmm -- so urbanecm: are the deploy cmds not complete for that? [00:13:47] yeah, the traditional scap sync-file someting commands don't work with i18n, because you need to rebuild i18n cache for it to take effect [00:13:51] (03Merged) 10jenkins-bot: Restore ReadingDepth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740613 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:14:01] (03Merged) 10jenkins-bot: Update access_method value in reading depth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740690 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:14:03] rebuilding i18n cache takes significant amount of time, and is only a part of scap sync-world [00:15:18] Normally it's not done (because of time, etc.), but in this case, the explanation given by zabe makes complete sense [00:15:59] urbanecm: i've never sync'd i18n [00:17:31] cjming: i see. Well, in that case, i feel it's better if i take on the window -- unless you want try that out, of course. [00:18:20] (https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&mobileaction=toggle_view_desktop#More_complex_changes:_sync_everything would be docs) [00:18:32] (got fixed in master with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/738555, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/740244 and multiple L10n-bot commits) [00:18:40] urbanecm: it seems like you'll have to stay up either way and you might get to bed earlier if you're not having to train a noob [00:20:50] (03PS1) 10Gergő Tisza: Cherry-picked small fixes [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740694 [00:20:53] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/739908 [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:47] (03CR) 10Urbanecm: [C: 03+2] Revert "Localisation updates from https://translatewiki.net." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740688 (https://phabricator.wikimedia.org/T296203) (owner: 10Zabe) [00:21:51] (03CR) 10Urbanecm: [C: 03+2] Revert "Create redirect Special Pages for delete and protect action" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740689 (https://phabricator.wikimedia.org/T295611) (owner: 10Zabe) [00:21:57] cjming: i guess you're right [00:22:00] okay, let's do it [00:22:49] nray: your patches are at mwdebug1001, can you test? [00:22:56] yes, thank you! [00:25:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:36] urbanecm: things look good, you can proceed [00:25:41] thanks, syncing both [00:25:47] thank you [00:28:35] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents/extension.json: 3f860c72bca817c40486b90f0d8e0ffca72b2690: Restore ReadingDepth instrument (1/2) (duration: 00m 56s) [00:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:25] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents: 3f860c7: fa9fbf1: WikimediaEvents bbackports (2/2; T294777) (duration: 00m 55s) [00:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:29] T294777: Restore reading depth schema - https://phabricator.wikimedia.org/T294777 [00:30:33] nray: should be live [00:30:40] (03PS5) 10Urbanecm: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:30:53] (03CR) 10Urbanecm: [C: 03+2] Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:31:02] thank you! [00:31:39] (03Merged) 10jenkins-bot: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [00:32:44] nray: your config patch is now at mwdebug1001, can you test? [00:32:52] yes, testing now, thanks [00:35:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:24] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:39:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:11] urbanecm: things look good, you can proceed! [00:40:18] nray: thanks, syncing [00:41:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b9209433dfc8b1f81a165ec75867337800db24b1: Enable reading depth instrumentation at low sampling rate (T294777) (duration: 00m 56s) [00:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:42] T294777: Restore reading depth schema - https://phabricator.wikimedia.org/T294777 [00:41:44] nray: should be live [00:41:47] anything else? [00:41:56] that's it thanks so much for your time! [00:42:06] np! [00:42:08] sorry it is so late there :( [00:42:20] thanks [00:42:27] (03Merged) 10jenkins-bot: Revert "Localisation updates from https://translatewiki.net." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740688 (https://phabricator.wikimedia.org/T296203) (owner: 10Zabe) [00:42:33] (03Merged) 10jenkins-bot: Revert "Create redirect Special Pages for delete and protect action" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740689 (https://phabricator.wikimedia.org/T295611) (owner: 10Zabe) [00:42:56] zabe: ad https://gerrit.wikimedia.org/r/c/mediawiki/core/+/740689, the correct sync order would be SpecialPageFactory => includes/specials => autoload (and then i18n)? [00:43:26] pulled to mwdebug1001 [00:43:45] yes (to the sync order) [00:43:58] ack [00:44:03] let me know how it look [00:44:10] note that i18n may not work well at debug srv [00:44:15] (cache likely wasn't rebuilded) [00:46:07] urbanecm: yeah, i18n messages are not updated on mwdebug. But it looks good, special pages disappeared and nothing seems to be breaking. [00:46:14] (03CR) 10jerkins-bot: [V: 04-1] Cherry-picked small fixes [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740694 (owner: 10Gergő Tisza) [00:46:18] sounds good [00:46:22] so, let's do that [00:48:02] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:48:20] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/includes/specialpage/SpecialPageFactory.php: 7c0e074: Revert "Create redirect Special Pages for delete and protect action" (T295611; T296203; 1/4) (duration: 00m 56s) [00:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:26] T295611: Special pages for delete and protect - https://phabricator.wikimedia.org/T295611 [00:48:26] T296203: monobook-action-delete and monobook-action-protect need to be changed back - https://phabricator.wikimedia.org/T296203 [00:49:33] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/includes/specials/: 7c0e074: Revert "Create redirect Special Pages for delete and protect action" (T295611; T296203; 2/4) (duration: 00m 55s) [00:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:53] doing autoload now [00:49:56] hoping nothing breaks [00:50:04] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:50:43] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/autoload.php: 7c0e074: Revert "Create redirect Special Pages for delete and protect action" (T295611; T296203; 3/4) (duration: 00m 55s) [00:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:34] !log urbanecm@deploy1002 Started scap: 69aa4a7: 7c0e074: Revert "Create redirect Special Pages for delete and protect action" (T295611; T296203; 4/4) [00:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:38] and full scap goes now [00:51:56] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:53:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:03:54] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:24] !log urbanecm@deploy1002 Finished scap: 69aa4a7: 7c0e074: Revert "Create redirect Special Pages for delete and protect action" (T295611; T296203; 4/4) (duration: 25m 50s) [01:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:29] T295611: Special pages for delete and protect - https://phabricator.wikimedia.org/T295611 [01:17:30] T296203: monobook-action-delete and monobook-action-protect need to be changed back - https://phabricator.wikimedia.org/T296203 [01:17:31] zabe: finished [01:17:34] so, we're done [01:17:42] !log UTC late window done [01:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:49] thanks for your help :) [01:18:18] np [01:20:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:26:08] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [02:07:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.10 [core] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/740707 [02:07:05] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.10 [core] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/740707 (owner: 10TrainBranchBot) [02:08:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:47] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: In Mailman3, users cannot change their display name from the web - https://phabricator.wikimedia.org/T283128 (10Legoktm) For reference, the query to fix someone's display name manually is `UPDATE address set display_name="" where email="" limit 1;` [02:29:03] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.10 [core] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/740707 (owner: 10TrainBranchBot) [02:37:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:10] Going to wait ~15 minutes or so for the current backlog of saneitizer-attributable cirrussearch work to drain off the queue as it gets completed, then going to start another round of rolling restarts on codfw elasticsearch [02:50:26] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:52:28] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:54:19] Backlog chewed through: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=1637632978750&to=1637636004782 [02:54:54] Okay, we're at significantly lower query load on elasticsearch than at peak. Going to kick off another round of codfw rolling restarts. Starting out with trying 2 nodes at a time (down from the default of 3); if that's still too much we'll try 1 [02:55:06] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [02:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:11] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [02:55:13] !log T295705 `ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation codfw "codfw plugin upgrade + restart" --upgrade --nodes-per-run 2 --start-datetime 2021-11-18T18:55:54 --task-id T295705` on tmux `rolling_restarts_codfw` [02:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:26] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [02:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:03] !log T295705 `elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9243): Read timed out. (read timeout=60))` Probably transient failure; will wait 10 mins and try again [02:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:41] (Trying again) [03:06:01] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [03:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:05] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [03:08:23] (03PS1) 10Ryan Kemper: cirrussearch: fix grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/740708 [03:09:06] (03CR) 10Ryan Kemper: "Pretty simple patch. Basically just need a sanity check that the grafana link works as intended (for example that the codfw one shows codf" [puppet] - 10https://gerrit.wikimedia.org/r/740708 (owner: 10Ryan Kemper) [03:17:20] PROBLEM - Host asw1-b12-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:17:20] PROBLEM - Host asw1-b13-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:18:04] PROBLEM - Host ganeti6003 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:04] PROBLEM - Host ganeti6004 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:14] PROBLEM - Host ganeti6001 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:26] PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:38] PROBLEM - Host asw1-b12-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:42] PROBLEM - Host asw1-b13-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:44] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:06] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:26] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:40] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:30:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:52] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:37:42] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:37:57] !log rebuilding metadata of all djvu files outside of commons (T296001) [03:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:01] T296001: DjVuHandler: getDimensionInfoFromMetaTree: PHP Notice: Undefined index: pages - https://phabricator.wikimedia.org/T296001 [03:39:14] hmmm looks like we lost connectivity to drmrs, but it's on a temporary link and not in user-facing service, so NBD [03:39:22] I'll ack stuff up [03:40:59] ACKNOWLEDGEMENT - Juniper alarms on asw1-b12-drmrs.wikimedia.org is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 185.15.58.131 Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [03:40:59] ACKNOWLEDGEMENT - Host asw1-b12-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:40:59] ACKNOWLEDGEMENT - Host asw1-b12-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:40:59] ACKNOWLEDGEMENT - Juniper alarms on asw1-b13-drmrs.wikimedia.org is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 185.15.58.132 Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [03:40:59] ACKNOWLEDGEMENT - Host asw1-b13-drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:40:59] ACKNOWLEDGEMENT - Host asw1-b13-drmrs.wikimedia.org IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:40:59] ACKNOWLEDGEMENT - SSH on ganeti6001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:41:00] ACKNOWLEDGEMENT - Host ganeti6001 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:41:00] ACKNOWLEDGEMENT - SSH on ganeti6002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:41:01] ACKNOWLEDGEMENT - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:41:01] ACKNOWLEDGEMENT - configured eth on ganeti6003 is CRITICAL: public reporting no carrier. Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:41:02] ACKNOWLEDGEMENT - SSH on ganeti6003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:41:02] ACKNOWLEDGEMENT - Host ganeti6003 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:41:03] ACKNOWLEDGEMENT - SSH on ganeti6004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:41:03] ACKNOWLEDGEMENT - Host ganeti6004 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:41:04] ACKNOWLEDGEMENT - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black drmrs network link failure - no user-facing impacts! [03:41:59] !log ladsgroup@mwmaint1002:~$ cat broken_imgs | xargs -I {} mwscript refreshImageMetadata.php --wiki=commonswiki --mediatype=OFFICE --verbose --mime 'image/*' --force --batch-size 1 --sleep 1 --start={} --end={} (T296001) [03:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:49] ACKNOWLEDGEMENT - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:49] ACKNOWLEDGEMENT - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:42:49] ACKNOWLEDGEMENT - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync Brandon Black drmrs network link failure - no user-facing impacts! https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:56:10] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:07:30] (03PS1) 10Ryan Kemper: cirrussearch: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740710 (https://phabricator.wikimedia.org/T295705) [04:08:56] (03PS2) 10Ryan Kemper: cirrussearch: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740710 (https://phabricator.wikimedia.org/T295705) [04:09:28] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:10:04] (03PS3) 10Ryan Kemper: cirrussearch: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740710 (https://phabricator.wikimedia.org/T295705) [04:10:20] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/740710 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [04:13:38] (03CR) 10BryanDavis: [C: 03+1] wikimania_scholarships: delete module and profile, remove from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [04:15:34] (03CR) 10Ryan Kemper: [C: 03+2] cirrussearch: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740710 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [04:17:38] !log T295705 Properly disabled the sane-itizer; we don't want it running until after we (a) complete rolling restarts and (b) restore the missing `commonswikI_file` index (which is blocked on the restarts) [04:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:43] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [04:22:54] (03PS1) 10Ryan Kemper: cirrussearch: s/sanitizer/saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) [04:23:24] PROBLEM - Check for large files in client bucket on mwmaint1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [04:24:39] (03CR) 10Ryan Kemper: "Had to disable the sane-itizer anyway, so this is a good time to fix the name." [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [04:24:56] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [04:38:41] (03CR) 10Jdrewniak: [C: 03+1] "LGTM. I'll schedule a backport window to deploy these changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [04:42:45] (03PS1) 10Razzi: superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) [04:43:16] (03PS5) 10Jdrewniak: Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [04:50:19] (03Abandoned) 10Razzi: superset: make webserver timeout 3 minutes [puppet] - 10https://gerrit.wikimedia.org/r/740683 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [04:51:06] RECOVERY - Host ganeti6002 is UP: PING OK - Packet loss = 0%, RTA = 85.24 ms [04:51:06] RECOVERY - Host ganeti6003 is UP: PING OK - Packet loss = 0%, RTA = 85.21 ms [04:51:06] RECOVERY - Host asw1-b13-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.42 ms [04:51:06] RECOVERY - Host asw1-b12-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.38 ms [04:51:06] RECOVERY - Host ganeti6004 is UP: PING OK - Packet loss = 0%, RTA = 85.19 ms [04:52:26] RECOVERY - Host ganeti6001 is UP: PING OK - Packet loss = 0%, RTA = 85.19 ms [04:53:46] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:54:04] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:54:34] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:54:44] RECOVERY - Host asw1-b12-drmrs.wikimedia.org IPv6 is UP: PING OK - Packet loss = 0%, RTA = 89.29 ms [04:54:48] RECOVERY - Host asw1-b13-drmrs.wikimedia.org IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.44 ms [04:55:50] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:56:24] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.60 ms [04:57:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:04] (03PS2) 10Razzi: superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) [05:01:42] (03CR) 10Razzi: "A small change on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/737738 - the setting we set only affects the sql lab feature" [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [05:05:18] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:10:48] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [05:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:52] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [05:26:33] !log T295705 Rolling restart of `codfw` complete. `elastic2044` was manually restarted earlier today so the cookbook didn't restart it (b/c we pass in a datetime cutoff threshold) so I'm manually upgrading and restarting that host [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:38] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [05:28:59] !log T295705 Downtimed `elastic2044` for one hour and doing a full reboot for good measure. Already ran the plugin upgrade: `DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins` [05:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:55] (03PS1) 10Marostegui: mariadb: Promote db1132 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) [06:16:18] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover time" [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [06:16:48] (03PS2) 10Marostegui: mariadb: Promote db1132 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) [06:37:08] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [06:41:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [06:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye [07:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:05] (03PS1) 10Giuseppe Lavagetto: trafficserver: rule for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) [07:33:38] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740765 (owner: 10Awight) [07:34:13] (03PS2) 10Giuseppe Lavagetto: trafficserver: rule for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) [07:45:13] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [07:52:20] !log Adjusting BGP on cr1-eqiad and cr2-eqiad to keep MED unchanged in iBGP. [07:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:40] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Sure, makes sense!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740765 (owner: 10Awight) [08:04:13] (03PS6) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [08:05:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32561/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:06:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:31] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:07:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32562/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:10:27] (03CR) 10Elukey: [V: 03+1] profile::base::certificates: deploy wmf-certificates only in prod (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:10:52] ^^^ Juniper alarms on cr2-eqiad is likely just me running "show route | no-more" on them making the CPU run hot. [08:10:56] Nothing to worry about. [08:12:22] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Jcross) @Jelto You have my approval as Manfredi's manager. [08:12:45] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:14:09] ^^ this one also [08:14:14] nearly complete now [08:17:21] (03PS2) 10Muehlenhoff: Add ownership annotations for additional Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/738262 (https://phabricator.wikimedia.org/T216088) [08:17:44] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:18:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:51] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:55] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Peachey88) [08:21:46] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for additional Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/738262 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:22:11] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740772 (owner: 10Awight) [08:22:45] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:23:01] (03PS9) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [08:31:35] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) I am wondering what is best to do for use cases like: * https://gerrit.wikimedia.org/r/c/operations/puppet/+/739463 (not merged yet) * https://gerrit.wikimedia.org/r/c/operations/puppet/+/739806 (merged,... [08:41:27] (03PS10) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [08:41:31] (03CR) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:41:49] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1132 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [08:43:10] (03PS2) 10Gergő Tisza: Cherry-picked small fixes [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740694 [08:43:14] (03PS1) 10Gergő Tisza: Structured task caching/filtering cherry-picks, step 1 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740774 [08:43:21] (03PS1) 10Gergő Tisza: Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 [08:43:27] (03PS1) 10Gergő Tisza: Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 [08:43:33] (03PS1) 10Gergő Tisza: Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) [08:44:00] (03PS1) 10Inductiveload: OSD: Add a ready hook for scripts [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740778 (https://phabricator.wikimedia.org/T180569) [08:44:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:54] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10KSiebert) @MMandere Is there any more action required from my side? [08:46:29] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10KSiebert) This must be that kind of request as well then: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like_Hue)... [08:49:48] (03PS2) 10Muehlenhoff: Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) [08:51:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:52:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Sure, works!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [08:56:53] (03CR) 10Jobo: [V: 03+2] admin: let parsoid-test-admins see parsoid logs and restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [08:57:47] (03CR) 10Awight: [C: 04-1] "I realize now that the patch needs to be split up for safe deployment, since CommonSettings depends on the InitialiseSettings files. If I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740772 (owner: 10Awight) [08:59:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "This is a valid issue and should be fixed, even if it is only in a PHPDoc comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight) [09:05:48] !log fixing incorrect grants of wikiadmin on localhost in s6 master in codfw with replication [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:09] (03CR) 10jerkins-bot: [V: 04-1] Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [09:09:27] (03CR) 10jerkins-bot: [V: 04-1] Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [09:09:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1124.eqiad.wmnet with OS bullseye [09:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:51] (03CR) 10Jobo: [V: 03+2] admin: let parsoid-test-admins run 'sudo mysql..' on test servers [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [09:10:51] (03PS1) 10Vgutierrez: prometheus::ops: Gather varnish mtail metrics on text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740781 (https://phabricator.wikimedia.org/T290005) [09:12:37] (03PS2) 10Muehlenhoff: admin: let parsoid-test-admins see parsoid logs and restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [09:13:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32563/console" [puppet] - 10https://gerrit.wikimedia.org/r/740781 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:16:44] !log dropping useless GRANTs on s6 eqiad master without replication (T296274) [09:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:48] T296274: Clean up wikiadmin GRANTs mess - https://phabricator.wikimedia.org/T296274 [09:23:18] (03PS11) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [09:23:22] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:49] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [09:26:11] (03CR) 10Volans: [C: 03+1] "LGTM, last spaces nit inline. No need for re-review" [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [09:26:37] (03PS12) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [09:26:44] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: turn off grafana db sync ahead of 8.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/740682 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [09:27:27] (03CR) 10Ema: [C: 03+1] prometheus::ops: Gather varnish mtail metrics on text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740781 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:27:39] !log dropping useless GRANTs on s6 eqiad replicas without replication (T296274) [09:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] T296274: Clean up wikiadmin GRANTs mess - https://phabricator.wikimedia.org/T296274 [09:27:48] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [09:28:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Gather varnish mtail metrics on text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740781 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:30:55] (03PS1) 10Vgutierrez: cumin: Add cache::text_haproxy nodes on A:cp-text [puppet] - 10https://gerrit.wikimedia.org/r/740783 (https://phabricator.wikimedia.org/T290005) [09:32:10] (03CR) 10ArielGlenn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:34:22] (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [09:35:22] (03PS1) 10Inductiveload: OSD: Handle cases where the image srcset attr is not set [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740784 (https://phabricator.wikimedia.org/T296260) [09:37:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1124.eqiad.wmnet with OS bullseye [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:51] PROBLEM - Host text-lb.eqsin.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:00] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [09:40:09] this paged [09:40:11] what's up [09:40:15] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:15] PROBLEM - Host ncredir-lb.eqsin.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (103.102.166.226) [09:40:21] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (103.102.166.240) [09:40:23] * volans acking the page [09:40:25] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:27] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (103.102.166.224) [09:40:30] here [09:40:34] around [09:40:35] PROBLEM - Host ncredir-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (198.35.26.98) [09:40:37] <_joe_> ok, let's check the problem is not just monitoring [09:40:37] em, what's going on? [09:40:43] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (198.35.26.112) [09:40:44] PROBLEM - Host pfw3-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.197) [09:40:45] PROBLEM - Host ncredir-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:50] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (198.35.26.96) [09:40:58] <_joe_> I can ping text-lb.eqsin.wikimedia.org with no problem [09:40:58] loop? [09:41:08] PROBLEM - Host ncredir-lb.eqsin.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:10] here [09:41:18] around if I can help [09:41:41] topranks: cr1-cr2 routing loop [09:41:54] ok [09:41:54] at least to ulsfo prefixes [09:41:56] need an IC? I can be one [09:41:58] in eqiad [09:42:07] <_joe_> I can browse the wikis via eqsin [09:42:11] (03PS1) 10MMandere: admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) [09:42:16] confirmed loop [09:42:21] ae0.cr1-eqiad.wikimedia.org [09:42:22] ae0.cr2-eqiad.wikimedia.org [09:42:38] with traceroute ncredir-lb.eqsin.wikimedia.org from alert1001 for example [09:42:41] from eqiad to ulsfo and eqsin only for now [09:43:02] PROBLEM - Host upload-lb.eqsin.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:08] I have reverted the change that I assume caused this [09:43:12] (03CR) 10jerkins-bot: [V: 04-1] admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [09:43:23] XioNoX: does this mean also return traffic? or is just alerting affected? [09:43:36] * volans keeping acking alerts [09:43:41] I expect it all [09:43:51] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:02] so basically what's the impact for real users [09:44:15] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740788 (https://phabricator.wikimedia.org/T295628) [09:44:15] [also here if necessary] [09:44:55] topranks: how long should take to recover? [09:45:03] still seeing the loop AFAICT [09:45:03] (03PS1) 10Jforrester: [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) [09:45:05] (03PS1) 10Jforrester: [BETA CLUSTER] Configure wikifunctionswiki in wikiversions-labs.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740790 (https://phabricator.wikimedia.org/T284162) [09:45:07] (03PS1) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) [09:45:09] (03PS1) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) [09:45:11] (03PS1) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) [09:45:13] (03PS1) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) [09:45:15] (03PS1) 10Jforrester: [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [09:45:31] it's also not all hosts [09:45:31] volans: sill trying to scope it out [09:45:37] a minute or two, if it's not recovered now ther is another problem [09:45:43] I think it's only to reach BGP prefixes [09:45:45] <_joe_> I can still see the wikis from eqsin [09:45:49] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:51] <_joe_> but maybe we should depool? [09:45:56] (03PS2) 10Jforrester: [WIP] deployment-prep: Add wikifunctions.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) [09:45:57] so it shouldn't have much user impact [09:46:04] (03PS3) 10Jforrester: deployment-prep: Add wikifunctions.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) [09:46:22] (03CR) 10jerkins-bot: [V: 04-1] [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [09:46:27] eg. it's to reach LVS VIPs between eqiad and ulsfo/eqsin [09:46:29] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:46:39] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:46:41] can you paste? [09:46:56] <_joe_> can I suggest moving the discussion to #sre? [09:47:02] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:47:07] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 (owner: 10Jforrester) [09:47:15] topranks: paste? [09:47:17] _joe_: sure [09:47:54] cmooney@cr2-eqsin> traceroute 208.80.154.224 source 103.102.166.130 no-resolve wait 1 [09:47:54] traceroute to 208.80.154.224 (208.80.154.224) from 103.102.166.130, 30 hops max, 52 byte packets [09:47:54] 1 103.102.166.140 0.486 ms 0.535 ms 0.555 ms [09:47:54] 2 103.102.166.139 217.334 ms 270.922 ms 217.526 ms [09:47:54] 3 208.80.153.220 247.585 ms 248.114 ms 247.134 ms [09:47:55] 4 208.80.154.224 247.499 ms 247.148 ms 247.307 ms [09:47:58] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740788 (https://phabricator.wikimedia.org/T295628) (owner: 10Kosta Harlan) [09:48:03] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:48:38] topranks: use a pastebin, otherwise the network is going to kick you off if you paste so many lines at once :P [09:50:37] (03CR) 10Gergő Tisza: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [09:51:29] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 246.98 ms [09:51:31] RECOVERY - Host ncredir-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 250.81 ms [09:51:37] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 253.87 ms [09:51:52] (03CR) 10Gergő Tisza: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [09:51:53] RECOVERY - Host ncredir-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 68.37 ms [09:51:57] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740788 (https://phabricator.wikimedia.org/T295628) (owner: 10Kosta Harlan) [09:52:05] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 68.18 ms [09:52:07] RECOVERY - Host pfw3-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [09:52:17] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 68.22 ms [09:52:31] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:34] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:45] (03PS2) 10Jforrester: [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) [09:53:47] (03PS2) 10Jforrester: [BETA CLUSTER] Configure wikifunctionswiki in wikiversions-labs.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740790 (https://phabricator.wikimedia.org/T284162) [09:53:49] (03PS2) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) [09:53:51] (03PS2) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) [09:53:53] (03PS2) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) [09:53:55] (03PS2) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) [09:53:57] (03PS2) 10Jforrester: [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [09:54:33] (03PS1) 10Vgutierrez: site: Reimage cp5012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740796 (https://phabricator.wikimedia.org/T290005) [09:54:56] (03CR) 10jerkins-bot: [V: 04-1] [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [09:54:59] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:55:07] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:55:15] (03CR) 10jerkins-bot: [V: 04-1] Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [09:55:35] godog: FYI the ms-fe2009 alert above is for ImportError: No module named monotonic [09:55:53] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 (owner: 10Jforrester) [09:56:12] volans: thank you, freshly provisioned host [09:57:34] !log cordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet - T293729 [09:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:38] T293729: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 [09:58:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Ok to try to get more clarity on the situation I briefly re-enabled the cr1-eqiad to cr2-eqord BGP session. But despite this I am not really seeing... [09:58:53] (03PS3) 10Jforrester: [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) [09:58:55] (03PS3) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) [09:58:57] (03PS3) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) [09:58:59] (03PS3) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) [09:59:01] (03PS3) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) [09:59:03] (03PS3) 10Jforrester: [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [09:59:08] (03CR) 10Vgutierrez: [C: 03+2] cumin: Add cache::text_haproxy nodes on A:cp-text [puppet] - 10https://gerrit.wikimedia.org/r/740783 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:00:53] (03PS4) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) [10:00:55] (03PS4) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) [10:00:57] (03PS4) 10Jforrester: [DNM] Miscellaneous config things for new wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [10:01:04] !log depool cp5012 to be reimaged as cache::text_haproxy - T290005 [10:01:04] (03PS1) 10Urbanecm: Backport localisation updates [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740797 [10:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:09] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:02:24] (03PS3) 10ArielGlenn: [WIP] add enterprise html dumps downloader settings file and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/737854 (https://phabricator.wikimedia.org/T273585) [10:02:36] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:54] jouncebot: nowandnext [10:02:54] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [10:02:54] In 1 hour(s) and 57 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1200) [10:03:11] (03CR) 10Urbanecm: [C: 03+2] "deploying" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740797 (owner: 10Urbanecm) [10:03:16] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [10:03:17] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10MMandere) @KSiebert not at the moment but I'll let you know if additional information is needed from you. [10:03:38] (03CR) 10Vgutierrez: [C: 03+1] "vgutierrez@cp3050:~$ nc -zv apple-search.discovery.wmnet 4013" [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [10:04:49] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740796 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:05:49] (03CR) 10Awight: VisualEditor template dialog: new sidebar and inline descriptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [10:05:55] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "broken CI in wmf.9" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740797 (owner: 10Urbanecm) [10:08:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5012.eqsin.wmnet with OS buster [10:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:52] !log urbanecm@deploy1002 Started scap: c98acaa2ab27e630c0a1b55a64fb81b131c087f9: Backport localisation updates [10:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5012.eqsin.wmnet with OS buster [10:10:14] (03PS1) 10Elukey: knative-serving: add stricter network policies for the activator pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/740804 (https://phabricator.wikimedia.org/T289834) [10:15:03] (03CR) 10Elukey: [C: 03+2] knative-serving: add stricter network policies for the activator pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/740804 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [10:17:02] (03PS1) 10Muehlenhoff: ganeti.reboot-vm: Add missing argument for Ganeti cluster selection [cookbooks] - 10https://gerrit.wikimedia.org/r/740805 [10:18:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:58] !log urbanecm@deploy1002 Finished scap: c98acaa2ab27e630c0a1b55a64fb81b131c087f9: Backport localisation updates (duration: 11m 06s) [10:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1121.eqiad.wmnet with reason: Maintenance T296143 [10:22:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1121.eqiad.wmnet with reason: Maintenance T296143 [10:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:23] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T296143)', diff saved to https://phabricator.wikimedia.org/P17800 and previous config saved to /var/cache/conftool/dbconfig/20211123-102234-ladsgroup.json [10:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:02] Amir1: for db1121 you can run it with replication, so it replicates to clouddb hosts [10:23:19] noted, thanks [10:23:22] make sure to downtime also the replicas from db1121 [10:23:25] as they will get lagged [10:23:30] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [10:23:37] okay [10:25:14] (03PS2) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [10:28:14] marostegui: do you know why cookbook sre.hosts.downtime --hours 4 -r "Maintenance T296143" "db1155.eqiad.wmnet and clouddb[1015,1019,1021].eqiad.wmnet" gives no hosts provided? [10:28:15] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:28:24] what did I messed up [10:28:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maintenance T296143 [10:29:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maintenance T296143 [10:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1155.eqiad.wmnet with reason: Maintenance T296143 [10:29:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1155.eqiad.wmnet with reason: Maintenance T296143 [10:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance T296143 [10:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance T296143 [10:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:17] Amir1: worked for me [10:30:36] I think I found out, it needed "or" [10:30:49] aha, it needed the comma [10:30:51] okay [10:31:24] Amir1: either the comma or P{db1155*} or P{clouddb[1015,1019,1021]*} [10:31:36] quicker is db1155*,clouddb[1015,1019,1021]* [10:31:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:48] RECOVERY - Host ncredir-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 68.21 ms [10:31:54] RECOVERY - Host upload-lb.eqsin.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 253.88 ms [10:32:00] RECOVERY - Host text-lb.eqsin.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 246.99 ms [10:32:10] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 68.22 ms [10:32:19] topranks: interesting that we're getting those recoveries just now, anything changed in the last few minutes? [10:32:25] volans: oh thanks. This is fancy [10:32:30] RECOVERY - Host ncredir-lb.eqsin.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 246.91 ms [10:32:34] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 68.18 ms [10:32:56] (03CR) 10Volans: [C: 03+1] "LGTM, lol, missed this one" [cookbooks] - 10https://gerrit.wikimedia.org/r/740805 (owner: 10Muehlenhoff) [10:33:25] (03PS3) 10Jelto: admin: let parsoid-test-admins see parsoid logs and restart php-fpm, mysql [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) [10:34:25] (03CR) 10Jelto: admin: let parsoid-test-admins see parsoid logs and restart php-fpm, mysql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [10:34:30] volans: to add insult to injury I failed to do all the steps for both v6 and v4, did it for v6 about 2 mins ago. [10:35:27] ack, got it :) [10:38:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:10] <_joe_> vgutierrez: ^^ I guess expected? [10:40:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [10:41:11] (03CR) 10Muehlenhoff: [C: 03+2] ganeti.reboot-vm: Add missing argument for Ganeti cluster selection [cookbooks] - 10https://gerrit.wikimedia.org/r/740805 (owner: 10Muehlenhoff) [10:42:37] (03PS2) 10Gergő Tisza: Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 [10:42:45] (03PS2) 10Gergő Tisza: Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 [10:42:51] (03PS2) 10Gergő Tisza: Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) [10:43:37] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:44:58] jouncebot: nowandnext [10:44:58] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [10:44:58] In 1 hour(s) and 15 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1200) [10:46:01] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:48:51] _joe_: yeah sorry, only one host under that role.. and reimages in eqsin are quite slow so prometheus tries to fetch the metrics before the node's been properly reimaged [10:49:33] cp5012 is still running its first puppet run [10:49:42] I'm not able to login yet using my ssh key :) [10:49:45] <_joe_> yeah np [10:49:54] <_joe_> it's a typical race condition [10:50:34] <_joe_> so what happens is that puppetdb already has the catalog [10:50:49] <_joe_> because it gets submitted at compile time, not at the end of the agent run [10:51:08] <_joe_> so in the interval, prometheus will see the target but services are still not actually up [10:52:56] (03CR) 10Gergő Tisza: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [10:53:34] (03PS1) 10Ladsgroup: Set test wikis to write both for actor temp table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740807 (https://phabricator.wikimedia.org/T275246) [10:56:50] (03CR) 10Majavah: "Does this need entries in cache::alternate_domains (hieradata/role/common/cache/text), like I did in I335ee474e3667b9288bf41927b959f0936ba" [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [10:57:03] (03CR) 10Ladsgroup: [C: 03+2] Set test wikis to write both for actor temp table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740807 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [10:57:19] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) [10:57:38] is there a way to "downtime" dynamically prometheus targets? [10:57:43] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) [10:57:55] (03Merged) 10jenkins-bot: Set test wikis to write both for actor temp table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740807 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [10:58:53] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) ^What do you think #data-engineering people? [11:00:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:01:12] (03PS2) 10Giuseppe Lavagetto: image-suggestion: fix the private files position [deployment-charts] - 10https://gerrit.wikimedia.org/r/740523 [11:01:15] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:22] (03PS2) 10MMandere: admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) [11:01:32] everything looks good, syncing [11:01:59] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) Like most things cloud related i think its worth splitting this a bit further and say that we have three scenarios to work with production, cloud and deployment-prep. And it is worth noting that for the ma... [11:02:21] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:740807|Set test wikis to write both for actor temp table migration (T275246)]] (duration: 00m 56s) [11:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:26] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [11:03:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] image-suggestion: fix the private files position [deployment-charts] - 10https://gerrit.wikimedia.org/r/740523 (owner: 10Giuseppe Lavagetto) [11:03:37] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10BTullis) Thanks @jcrespo - I'm happy with that proposed change and with the naming convention. > ...pro... [11:04:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] !log uncordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729 [11:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:46] T293729: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 [11:05:55] !log cordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729 [11:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:18] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) > it will give us greater flexibility if and when we want the dbstore* and db* configurations... [11:07:29] (03Merged) 10jenkins-bot: image-suggestion: fix the private files position [deployment-charts] - 10https://gerrit.wikimedia.org/r/740523 (owner: 10Giuseppe Lavagetto) [11:08:14] !log start of mwscript migrateRevisionActorTemp.php --wiki=testwiki --sleep=5 (T275246) [11:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:18] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [11:08:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:40] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10BTullis) Understood, thanks. Well I'm on-board with it. [11:12:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [11:15:30] !log pool cp5012 (text) using HAProxy as TLS terminator - T290005 [11:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:35] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:16:09] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5012.eqsin.wmnet with OS buster [11:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:38] 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) https://review.opendev.org/c/openstack/swift/+/818881 is this patch sent upstream. [11:17:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5012.eqsin.wmnet with OS buster c... [11:17:45] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:18:30] (03CR) 10Jelto: [C: 03+2] admin: let parsoid-test-admins see parsoid logs and restart php-fpm, mysql [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [11:18:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:19:57] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:21:57] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:37] !log powercycle ms-be2058 - down and nothign on console [11:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:56] (03CR) 10Btullis: [C: 03+1] "LGTM2 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/740233 (owner: 10Mforns) [11:27:14] (03PS1) 10Giuseppe Lavagetto: shellbox: allow scraping of the monitoring port [deployment-charts] - 10https://gerrit.wikimedia.org/r/740810 [11:29:21] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [11:30:23] PROBLEM - Host ms-be2058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:33:25] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:33] (03PS2) 10Giuseppe Lavagetto: php apps: allow scraping of the monitoring port [deployment-charts] - 10https://gerrit.wikimedia.org/r/740810 [11:35:01] (03PS1) 10Majavah: apple-search: use image 2021-11-15-220540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740812 [11:35:49] RECOVERY - Host ms-be2058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.17 ms [11:36:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10fgiunchedi) Hi @papaul, the regex should cover all ms-fe2* hosts (i.e. ms-fe2 plus three digits), though you are right that it could be another spot where we... [11:37:50] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10fgiunchedi) [11:38:57] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: tweak reserved space for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/740633 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [11:39:52] (03PS1) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) [11:41:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2002.codfw.wmnet [11:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2002.codfw.wmnet [11:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10fgiunchedi) [11:41:54] moritzm: that was too quick :D [11:42:29] if this is the usual time I'm afraid the icinga logic would not work [11:42:32] in the cookbook [11:42:38] (03CR) 10Btullis: Add more alerts to the data-engineering team (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [11:44:18] volans: yeah, indeed :-) poking at the cookbook [11:45:17] moritzm: ah, it might miss a remote's wait_reboot_since() and puppet's wait_since() [11:46:17] (03PS4) 10Btullis: Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) [11:46:25] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/32564/" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [11:47:15] indeed, adding that now [11:49:56] (03PS1) 10Muehlenhoff: ganeti.reboot-vm: Correctly wait for recovery [cookbooks] - 10https://gerrit.wikimedia.org/r/740817 [11:50:17] _joe_: I noticed that we aren't using the latest apple-search image, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/740812/ [11:51:04] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/740817 (owner: 10Muehlenhoff) [11:51:20] (03PS2) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) [11:51:39] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [11:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:49] !log btullis@cumin1001 END (ERROR) - Cookbook sre.aqs.roll-restart (exit_code=97) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [11:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:31] (03CR) 10jerkins-bot: [V: 04-1] ganeti.reboot-vm: Correctly wait for recovery [cookbooks] - 10https://gerrit.wikimedia.org/r/740817 (owner: 10Muehlenhoff) [11:53:39] (03PS2) 10Muehlenhoff: ganeti.reboot-vm: Correctly wait for recovery [cookbooks] - 10https://gerrit.wikimedia.org/r/740817 [11:54:45] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aarora-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:56] !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching O:aqs: restarting to pick up new JRE - btullis@cumin1001 [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:56] (03CR) 10Muehlenhoff: [C: 03+2] ganeti.reboot-vm: Correctly wait for recovery [cookbooks] - 10https://gerrit.wikimedia.org/r/740817 (owner: 10Muehlenhoff) [11:58:05] Looking at that `jupyter-aarora-singleuser.service` - Arguably it shouldn't create an alert. [11:58:09] <_joe_> majavah: yeah i also need to release a fix to the chart [11:58:53] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1200). [12:00:04] eigyan and inductiveload: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] o/ [12:00:16] \o [12:00:51] let’s start with inductiveload’s backports then [12:01:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2003.codfw.wmnet [12:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] OSD: Add a ready hook for scripts [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740778 (https://phabricator.wikimedia.org/T180569) (owner: 10Inductiveload) [12:03:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php apps: allow scraping of the monitoring port [deployment-charts] - 10https://gerrit.wikimedia.org/r/740810 (owner: 10Giuseppe Lavagetto) [12:04:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2003.codfw.wmnet [12:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] OSD: Handle cases where the image srcset attr is not set (031 comment) [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740784 (https://phabricator.wikimedia.org/T296260) (owner: 10Inductiveload) [12:06:59] (03Merged) 10jenkins-bot: php apps: allow scraping of the monitoring port [deployment-charts] - 10https://gerrit.wikimedia.org/r/740810 (owner: 10Giuseppe Lavagetto) [12:08:50] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) >>! 在T294676#7522173中,@Legoktm写道: > Both lists have been created, please update the description and any other settings as necessary (or re-open if something is tota... [12:09:45] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes [puppet] - 10https://gerrit.wikimedia.org/r/740579 (https://phabricator.wikimedia.org/T292546) [12:09:46] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [12:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:17] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [12:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: use image 2021-11-15-220540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740812 (owner: 10Majavah) [12:17:11] (03Merged) 10jenkins-bot: apple-search: use image 2021-11-15-220540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740812 (owner: 10Majavah) [12:17:13] (03PS7) 10Jbond: public_cloud: Add public_clouds_shutdown to global config [puppet] - 10https://gerrit.wikimedia.org/r/740545 [12:17:15] (03PS1) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 [12:17:33] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [12:17:42] (03CR) 10David Caro: [C: 03+1] "This is because: https://wiki.openstack.org/wiki/OSSN/OSSN-0085" [puppet] - 10https://gerrit.wikimedia.org/r/740579 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:18:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes [puppet] - 10https://gerrit.wikimedia.org/r/740579 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:20:19] (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Only select from list of clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 [12:21:15] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [12:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:48] (03Merged) 10jenkins-bot: OSD: Add a ready hook for scripts [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740778 (https://phabricator.wikimedia.org/T180569) (owner: 10Inductiveload) [12:22:16] Lucas_WMDE: ^ [12:22:25] thanks, I got distracted ^^ [12:22:28] sorry [12:22:48] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.reboot-vm: Only select from list of clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [12:23:22] inductiveload: the first backport (ready hook for scripts) should be on mwdebug1001, can you test it? [12:24:18] that is working [12:24:24] (03Merged) 10jenkins-bot: OSD: Handle cases where the image srcset attr is not set [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740784 (https://phabricator.wikimedia.org/T296260) (owner: 10Inductiveload) [12:24:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] ok, then I can sync that one [12:25:56] (03PS2) 10Muehlenhoff: sre.ganeti.reboot-vm: Only select from list of clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 [12:26:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/ProofreadPage/modules/page/ext.proofreadpage.page.edit.js: Backport: [[gerrit:740778|OSD: Add a ready hook for scripts (T180569)]] (duration: 00m 56s) [12:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:31] T180569: Fire mw.hook() event as soon as edit page rearrangement has been completed - https://phabricator.wikimedia.org/T180569 [12:27:02] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [12:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:08] inductiveload: and now the second change (no srcset fix) is also on mwdebug1001, please test again :) [12:27:31] that is also working :-) [12:27:37] ok :) [12:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/ProofreadPage/modules/page/ext.proofreadpage.page.edit.js: Backport: [[gerrit:740784|OSD: Handle cases where the image srcset attr is not set (T296260)]] (duration: 00m 56s) [12:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:14] T296260: Zooming page images not working on some proofread pages on de.wikisource (lacks srcset attribute) - https://phabricator.wikimedia.org/T296260 [12:29:17] * Lucas_WMDE keeps an eye on the client errors dashboards [12:29:44] eigyan: hi! are you ready for the config change deployment? [12:29:55] yes I am [12:30:02] dewikisource has errors anyway, AFAIK that's local scripts [12:31:20] (03CR) 10Ssingh: [C: 03+1] admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [12:31:26] (03PS20) 10Lucas Werkmeister (WMDE): Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [12:31:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [12:31:54] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [12:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:15] (03Merged) 10jenkins-bot: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [12:32:19] (03CR) 10MMandere: [C: 03+2] admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [12:32:55] hm, I guess this can’t actually be tested on mwdebug, since it’s beta-only [12:33:01] I’ll just sync it [12:33:11] and then you should be able to test it on beta in 10-20 minutes, eigyan [12:33:35] Lucas_WMDE sweet thanks [12:33:46] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jelto) All team members should have access now and should be able to execute the needed commands. I'm closing this task. Feel free to... [12:34:02] (03PS3) 10Ssingh: admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [12:34:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:737503|Set up beta test environment for QuickSurveys (T293798)]] (beta only) (duration: 00m 55s) [12:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] T293798: QA needs a Quick Survey test environment on beta - https://phabricator.wikimedia.org/T293798 [12:35:03] ok, hook docs updated in the MW.org extensions docs, I'm all done: thank you very much [12:35:25] \o/ [12:35:31] (03CR) 10MMandere: [V: 03+2 C: 03+2] admin: Add ksiebert to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740787 (https://phabricator.wikimedia.org/T295777) (owner: 10MMandere) [12:36:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:37:47] speaking of client errors, I'm getting spammed with "Referrer Policy: Less restricted policies, including ‘no-referrer-when-downgrade’, ‘origin-when-cross-origin’ and ‘unsafe-url’, will be ignored soon for the cross-site request:" in the firefox console [12:38:07] they're only info level, though [12:38:59] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:39:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:10] !log aborrero@cumin1001 START - Cookbook sre.ganeti.makevm for new host cloudbackup1002-dev.eqiad.wmnet [12:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:48] <_joe_> majavah: https://grafana.wikimedia.org/d/SaQD7Dp7k/apple-search :) [12:45:41] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32566/console" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [12:46:07] <_joe_> some tuning to do I think (like, adding some apache workers, reduce the memory footprint...) but I think we're all set to switch the traffic tomorrow [12:46:12] !log UTC morning backport+config window done [12:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10MMandere) 05Open→03Resolved a:03MMandere @KSiebert you now should be able to see private data on Superset. Please feel free to reach o... [12:46:20] (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce cloudbackup1002-dev.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/740820 (https://phabricator.wikimedia.org/T295584) [12:50:06] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Majavah) [12:50:17] (03CR) 10Ema: [V: 03+1 C: 03+1] "One more thing we could add for testing is temporarily setting the value for one node in this CR and looking at PCC output. Right now, wit" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [12:52:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudbackup1002-dev.eqiad.wmnet [12:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:23] Lucas_WMDE I have verified my survey is now deployed in beta...thank you! [12:53:36] (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce cloudbackup1002-dev.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/740820 (https://phabricator.wikimedia.org/T295584) [12:54:00] great! sorry for delaying it a bit, I’m glad a solution to limit it to enwiki was found :) [12:54:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce cloudbackup1002-dev.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/740820 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [12:58:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching O:aqs: restarting to pick up new JRE - btullis@cumin1001 [12:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:09] (03PS1) 10MMandere: admin: Add samwilson to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740826 (https://phabricator.wikimedia.org/T296161) [13:00:14] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix keyring owner/group for cinder-backups [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) [13:01:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus2005.codfw.wmnet with OS bullseye [13:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:31] (03CR) 10jerkins-bot: [V: 04-1] cloud: codfw1dev: fix keyring owner/group for cinder-backups [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:02:13] (03PS1) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 [13:02:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:02:32] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Adding some notes: 1) A big use case for `profile::base::certificates` is to create jks truststore for java, that require an entry for every trusted certificate. The wmf-certificates package is currently... [13:03:19] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:03:35] (03PS2) 10Jbond: R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 [13:04:26] (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix keyring owner/group for cinder-backups [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) [13:04:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:05:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32570/console" [puppet] - 10https://gerrit.wikimedia.org/r/740818 (owner: 10Jbond) [13:06:27] (03PS3) 10Muehlenhoff: sre.ganeti.reboot-vm: Query the cluster name from Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 [13:07:16] 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) created second VM: `lang=shell-session aborrero@cumin1001:~ $ sudo cookbook sre.ganeti.makevm eqiad_B cloudbackup1002-dev --vcpus 2 --memory 4... [13:10:16] (03CR) 10Ssingh: [C: 03+1] admin: Add samwilson to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740826 (https://phabricator.wikimedia.org/T296161) (owner: 10MMandere) [13:11:56] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: refresh keyname for codfw1dev cinder backups [labs/private] - 10https://gerrit.wikimedia.org/r/740829 (https://phabricator.wikimedia.org/T292546) [13:12:27] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloud: refresh keyname for codfw1dev cinder backups [labs/private] - 10https://gerrit.wikimedia.org/r/740829 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:12:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1132 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [13:12:50] (03PS3) 10Marostegui: mariadb: Promote db1132 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/740714 (https://phabricator.wikimedia.org/T288720) [13:14:13] (03PS8) 10Jbond: public_cloud: Add public_clouds_shutdown to global config [puppet] - 10https://gerrit.wikimedia.org/r/740545 [13:14:15] (03PS3) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 [13:14:17] (03PS2) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 [13:14:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Add more alerts to the data-engineering team (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [13:15:08] (03CR) 10Filippo Giunchedi: [C: 03+1] upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [13:15:19] (03PS3) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix keyring owner/group for cinder-backups [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) [13:16:45] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:41] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32575/cloudbackup1001-dev.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:18:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: codfw1dev: fix keyring owner/group for cinder-backups [puppet] - 10https://gerrit.wikimedia.org/r/740827 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:22:38] (03CR) 10Volans: [C: 03+1] "LGTM small optional nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [13:29:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2005.codfw.wmnet with OS bullseye [13:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:13] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus2006.codfw.wmnet with OS bullseye [13:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] In 20 minutes we will switch m5 master [13:43:31] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [13:44:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "+Cole as a FYI, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [13:45:16] 10SRE, 10Analytics-Radar, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [13:45:44] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Closing since it looks like scholarships is being replaced [13:47:15] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) >In the "only wmf-certificate" option, IIUC we should create profile::base::certificates::trusted_ca::path, but what values would it be for say deployment-prep? it would be `/etc/ssl/localcerts/wmf_trusted_... [13:48:53] !log add 80G to prometheus global in eqiad [13:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] !log powercycle (again) ms-be2058 [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:03] (03PS7) 10Jbond: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [13:54:48] (03CR) 10jerkins-bot: [V: 04-1] profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [13:55:05] (03CR) 10Ema: [C: 03+1] R:varnish:instance: Add genral public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (owner: 10Jbond) [13:55:59] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [13:56:48] (03PS8) 10Jbond: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [13:57:27] (03PS3) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 [13:58:25] (03CR) 10jerkins-bot: [V: 04-1] profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [13:58:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32577/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [13:59:12] andrewbogott, bd808, Amir1 around? [13:59:19] o/ [13:59:23] \o [13:59:38] o/ [14:00:08] let's go for it? [14:00:20] sure [14:00:25] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2006.codfw.wmnet with OS bullseye [14:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:33] !log Failover m5 from db1128 to db1132 - T288720 [14:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:37] T288720: Failover m5 master (db1128) to db1132 to upgrade its kernel - https://phabricator.wikimedia.org/T288720 [14:01:22] all done [14:01:34] let's check services [14:01:47] striker seems ok [14:02:03] (03CR) 10Jbond: [V: 03+1] profile::base::certificates: deploy wmf-certificates only in prod (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:02:12] mailman looks okay [14:02:16] toolhub? [14:02:17] (03CR) 10Muehlenhoff: sre.ganeti.reboot-vm: Query the cluster name from Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [14:02:35] andrewbogott: striker just gave my test account a duplicate "welcome" message https://wikitech.wikimedia.org/w/index.php?title=User_talk:Majavah_test&action=history&curid=448691 [14:02:36] orchestrator shows lag, but that's expected and i will clean it up once we are happy about the switch [14:02:55] majavah: that's me editing your test account :) [14:02:57] no, two duplicate welcome messages [14:02:59] (03PS9) 10Jbond: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:03:04] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:03:04] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [14:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] majavah: I think it's normal behavior, I just wanted to make sure Striker can still write to the db [14:03:29] * bd808 finally realizes which window was pinging [14:04:12] (03PS4) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [14:04:26] (03PS4) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [14:04:34] I'm testing writes in mailman and it's slow like the last time but it just needs time [14:04:35] (03PS5) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [14:04:36] toolhub seems good to me too? [14:04:42] I can search stuff [14:04:58] if we can generate a write, that'd be a good test [14:05:02] marostegui: yeah, toolhub looks good to me [14:05:10] excellent thanks bd808 [14:05:10] mailman write is also good https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/thread/45NOME7GHJPZSVXUXH2DIDBOG4ZRE2N2/ [14:05:24] so I think we are done then? :) [14:05:36] yep! [14:05:46] thank you all very much!!! [14:05:57] (03PS6) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 [14:06:01] if it breaks for whatever reason, just restart mailman and mailman-web services in lists1001.wikimedia.org [14:07:29] Great! Thanks again - I am going to continue with the clean up steps [14:07:35] RO time was 17 seconds [14:09:12] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [14:09:13] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:21] 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10fgiunchedi) [14:10:39] (03PS7) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 [14:14:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) Not as much as before. As for the email, it can be changed if necessary; otherwise, I'd rather leave my volunteer one. [14:14:42] (03PS8) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [14:16:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32580/console" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:17:46] (03PS1) 10Marostegui: dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) [14:18:33] (03PS9) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [14:18:54] (03PS4) 10Muehlenhoff: sre.ganeti.reboot-vm: Query the cluster name from Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 [14:18:58] (03CR) 10Jbond: [V: 03+1] "pcc should look cleaner once the earlier changes have been merged" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:19:49] (03CR) 10Marostegui: [C: 04-2] "Do not push it yet until db1132 is considered ok" [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [14:20:01] (03CR) 10Ladsgroup: [C: 03+1] dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [14:21:53] (03PS1) 10Jbond: WIP: do not merge - CR to test varnish changes [puppet] - 10https://gerrit.wikimedia.org/r/740842 [14:25:36] (03CR) 10Jcrespo: [C: 03+1] dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [14:29:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32582/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:30:10] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:26] !log jbond@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=puppetboard [14:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] (03CR) 10Volans: [C: 04-1] "Missing one call" [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [14:39:51] (03PS5) 10Muehlenhoff: sre.ganeti.reboot-vm: Query the cluster name from Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 [14:40:07] (03CR) 10Muehlenhoff: sre.ganeti.reboot-vm: Query the cluster name from Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [14:40:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [14:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [14:42:09] (03CR) 10Elukey: [V: 03+1] "The issue is still here:" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:42:35] (03PS10) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [14:43:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32583/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:44:44] 10Puppet, 10Infrastructure-Foundations: Add check for puppetboard - https://phabricator.wikimedia.org/T296304 (10jbond) p:05Triage→03Medium [14:47:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32584/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [14:48:12] (03CR) 10Elukey: [V: 03+1] "John (and others) Let me know if the change is ok, from my point of view it does what we need (no op in prod, clean up wmf-certificates in" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [15:01:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2010.codfw.wmnet with OS stretch [15:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2010.codfw.wmnet with OS stretch [15:02:56] (03CR) 10Jbond: [C: 03+1] profile::base::certificates: deploy wmf-certificates only in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [15:14:34] (03CR) 10MVernon: [C: 03+2] profile::thanos::swift: add account for research datasets poc [puppet] - 10https://gerrit.wikimedia.org/r/737913 (https://phabricator.wikimedia.org/T294380) (owner: 10MVernon) [15:16:56] (03CR) 10MVernon: [V: 03+2 C: 03+2] profile::thanos::swift: fake creds for research_poc [labs/private] - 10https://gerrit.wikimedia.org/r/737915 (https://phabricator.wikimedia.org/T294380) (owner: 10MVernon) [15:19:26] (03PS1) 10Giuseppe Lavagetto: apple-search: fine-tune limits for production usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/740849 [15:20:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: fine-tune limits for production usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/740849 (owner: 10Giuseppe Lavagetto) [15:24:43] (03Merged) 10jenkins-bot: apple-search: fine-tune limits for production usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/740849 (owner: 10Giuseppe Lavagetto) [15:26:45] (03CR) 10Btullis: [C: 03+2] analytics:refinery:job:druid_load: Reduce shard size for netflow_sanitized [puppet] - 10https://gerrit.wikimedia.org/r/740233 (owner: 10Mforns) [15:27:01] !log rolling restart of thanos frontends T294380 [15:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:06] T294380: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 [15:27:42] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: Detect duplicate ptr records [puppet] - 10https://gerrit.wikimedia.org/r/740850 (https://phabricator.wikimedia.org/T296144) [15:29:30] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: Detect duplicate ptr records [puppet] - 10https://gerrit.wikimedia.org/r/740850 (https://phabricator.wikimedia.org/T296144) (owner: 10Andrew Bogott) [15:31:45] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [15:31:51] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [15:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:48] Emperor: merged your labs-private changes [15:33:08] elukey: oh, sorry, do those need puppet-merging (on regular puppetmaster) too? [15:34:06] Emperor: np! It was only as FYI, puppet-merge checks for diffs in ther as first step (so I merged) [15:41:53] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [15:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740857 (owner: 10Awight) [15:43:40] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740856 (owner: 10Awight) [15:43:55] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740855 (owner: 10Awight) [15:44:33] (03Abandoned) 10Awight: Simplify many $wmg- configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740772 (owner: 10Awight) [15:45:18] (03PS1) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [15:46:15] (03PS2) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T29630) [15:46:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2010.codfw.wmnet with OS stretch [15:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2010.codfw.wmnet with OS stretch completed: - ms-fe2010 (*... [15:46:56] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [15:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) @fgiunchedi don't know what happen this time but the second re-image works. maybe i missed something yesterday. But all good on ms-fe2010 now. Thanks. [15:54:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2011.codfw.wmnet with OS stretch [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:44] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2011.codfw.wmnet with OS stretch [15:55:51] (03PS1) 10Giuseppe Lavagetto: thanos::frontend: correct the realserver pools services [puppet] - 10https://gerrit.wikimedia.org/r/740859 [15:57:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32585/console" [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [15:59:22] (03PS2) 10Giuseppe Lavagetto: thanos::frontend: correct the realserver pools services [puppet] - 10https://gerrit.wikimedia.org/r/740859 [15:59:52] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks, but I'd like Filippo to look too, since he's still the Swift expert." [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [16:00:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32586/console" [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [16:00:52] (03CR) 10MVernon: [C: 03+1] thanos::frontend: correct the realserver pools services [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [16:01:17] (03CR) 10Reedy: [C: 04-1] "Wrong task number 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T29630) (owner: 10JMeybohm) [16:01:39] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) I am confused, msw1-eqiad in A8 is already an EX-4300 48T. Do we want to replace with the same switch? [16:02:17] (03PS3) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [16:02:34] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos::frontend: correct the realserver pools services [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [16:02:51] (03CR) 10JMeybohm: calico: Allow to configure the IPAM module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [16:05:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] thanos::frontend: correct the realserver pools services [puppet] - 10https://gerrit.wikimedia.org/r/740859 (owner: 10Giuseppe Lavagetto) [16:06:57] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:43] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.reboot-vm: Query the cluster name from Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/740819 (owner: 10Muehlenhoff) [16:13:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [16:13:28] !log updating mgmt switches in row C, racks C2-C8 eqiad T259758 [16:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:32] T259758: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 [16:14:12] (03PS2) 10Awight: VisualEditor template dialog: new sidebar and inline descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) [16:14:31] (03CR) 10Awight: VisualEditor template dialog: new sidebar and inline descriptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [16:16:04] PROBLEM - Host an-tool1010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:08] PROBLEM - Host cloudcephosd1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:09] PROBLEM - Host cloudcephosd1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:09] PROBLEM - Host cloudcephosd1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:10] PROBLEM - Host cloudcephosd1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:11] PROBLEM - Host cloudcephosd1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:12] PROBLEM - Host cloudcephosd1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:13] PROBLEM - Host cloudcephosd1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:14] PROBLEM - Host cloudcephosd1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:15] PROBLEM - Host cloudcephosd1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:16] PROBLEM - Host cloudcephosd1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:17] PROBLEM - Host cloudcephosd1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:19] PROBLEM - Host cloudgw1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:26] PROBLEM - Host cloudvirt1025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:26] PROBLEM - Host cloudvirt1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:30] PROBLEM - Host cloudvirt1031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:30] PROBLEM - Host cloudvirt1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:16:57] RECOVERY - Host cloudcephosd1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [16:17:25] PROBLEM - Host mw1413.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:37] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "^" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [16:17:49] RECOVERY - Host cloudvirt1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [16:18:35] RECOVERY - Host cloudvirt1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [16:18:39] PROBLEM - Host db1131.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:18:42] (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Pass as a string [cookbooks] - 10https://gerrit.wikimedia.org/r/740860 [16:19:39] RECOVERY - Host an-tool1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [16:19:39] RECOVERY - Host cloudgw1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [16:20:01] sorry, I thought down timing msw1 would stop the messages [16:21:31] RECOVERY - Host cloudcephosd1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [16:21:31] RECOVERY - Host cloudcephosd1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [16:21:32] RECOVERY - Host cloudcephosd1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [16:21:32] RECOVERY - Host cloudcephosd1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [16:21:33] RECOVERY - Host cloudcephosd1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:21:34] RECOVERY - Host cloudcephosd1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.87 ms [16:21:35] RECOVERY - Host cloudcephosd1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [16:21:36] RECOVERY - Host cloudcephosd1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [16:21:37] RECOVERY - Host cloudcephosd1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [16:21:38] RECOVERY - Host cloudcephosd1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [16:21:51] RECOVERY - Host cloudvirt1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [16:21:55] RECOVERY - Host cloudvirt1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [16:22:54] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.reboot-vm: Pass as a string [cookbooks] - 10https://gerrit.wikimedia.org/r/740860 (owner: 10Muehlenhoff) [16:22:55] RECOVERY - Host mw1413.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [16:24:05] RECOVERY - Host db1131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms [16:28:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2011.codfw.wmnet with OS stretch [16:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2011.codfw.wmnet with OS stretch completed: - ms-fe2011 (*... [16:29:29] PROBLEM - Host wdqs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:30:13] PROBLEM - Host elastic1043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:30:17] PROBLEM - Host db1146.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:30:33] PROBLEM - Host cloudmetrics1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:33:47] PROBLEM - Host an-worker1108.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:34:39] PROBLEM - Host mc1049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:34:55] PROBLEM - Host ms-fe1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:35:21] RECOVERY - Host wdqs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [16:35:25] PROBLEM - Host ores1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:36:05] RECOVERY - Host elastic1043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [16:36:11] RECOVERY - Host db1146.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [16:36:25] RECOVERY - Host ores1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [16:36:28] RECOVERY - Host cloudmetrics1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [16:36:33] (03PS1) 10Jcrespo: mediabackups: Backup testcommonswiki [puppet] - 10https://gerrit.wikimedia.org/r/740862 (https://phabricator.wikimedia.org/T262668) [16:36:45] (03PS2) 10Jcrespo: mediabackups: Backup testcommonswiki [puppet] - 10https://gerrit.wikimedia.org/r/740862 (https://phabricator.wikimedia.org/T262668) [16:37:51] (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Only pass the hostname to netbox_server, not the entire FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/740863 [16:38:57] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup testcommonswiki [puppet] - 10https://gerrit.wikimedia.org/r/740862 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:39:21] PROBLEM - Host an-druid1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2003.codfw.wmnet [16:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:27] PROBLEM - Host analytics1066.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:29] PROBLEM - Host analytics1074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:47] RECOVERY - Host an-worker1108.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms [16:40:43] RECOVERY - Host mc1049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:40:59] RECOVERY - Host ms-fe1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:41:09] RECOVERY - Host analytics1066.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.84 ms [16:41:25] (03CR) 10Jbond: "PCC with option turned on" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [16:41:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2003.codfw.wmnet [16:41:34] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.reboot-vm: Only pass the hostname to netbox_server, not the entire FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/740863 (owner: 10Muehlenhoff) [16:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:45] (03CR) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [16:44:55] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:35] RECOVERY - Host an-druid1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:45:41] RECOVERY - Host analytics1074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [16:46:57] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:46:59] PROBLEM - Host an-worker1099.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:07] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:47:07] PROBLEM - Host logstash1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2001.codfw.wmnet [16:47:09] PROBLEM - Host mc1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:09] PROBLEM - Host mc1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:09] PROBLEM - Host mc1047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:09] PROBLEM - Host mc1048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:25] PROBLEM - Host ms-be1062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:25] PROBLEM - Host ps1-c2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:47:25] PROBLEM - Host ms-be1050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:25] PROBLEM - Host ms-be1049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:27] PROBLEM - Host ms-be1066.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:51] PROBLEM - Host thanos-fe1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:03] (03PS2) 10Muehlenhoff: sre.ganeti.reboot-vm: Only pass the hostname to netbox_server, not the entire FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/740863 [16:48:13] PROBLEM - Host an-worker1088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:19] PROBLEM - Host an-worker1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:19] PROBLEM - Host an-worker1111.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:27] PROBLEM - Host an-worker1131.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:27] PROBLEM - Host an-worker1132.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:49:13] RECOVERY - Host ps1-c2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [16:49:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2001.codfw.wmnet [16:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:53] (03PS1) 10Muehlenhoff: Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119) [16:51:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2012.codfw.wmnet with OS stretch [16:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2012.codfw.wmnet with OS stretch [16:52:05] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [16:52:09] RECOVERY - Host logstash1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [16:52:10] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.reboot-vm: Only pass the hostname to netbox_server, not the entire FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/740863 (owner: 10Muehlenhoff) [16:52:11] RECOVERY - Host mc1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [16:52:11] RECOVERY - Host mc1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [16:52:11] RECOVERY - Host mc1047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [16:52:11] RECOVERY - Host mc1048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.68 ms [16:52:17] RECOVERY - Host ms-be1062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.92 ms [16:52:29] RECOVERY - Host ms-be1049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:52:29] RECOVERY - Host ms-be1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [16:52:29] RECOVERY - Host ms-be1066.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [16:52:55] RECOVERY - Host thanos-fe1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [16:53:16] RECOVERY - Host an-worker1088.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:53:23] RECOVERY - Host an-worker1099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [16:53:23] RECOVERY - Host an-worker1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [16:53:23] RECOVERY - Host an-worker1111.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [16:53:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doc2001.codfw.wmnet [16:53:26] (03PS1) 10Majavah: rsyslogd: send cinder-backup to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/740865 [16:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:31] RECOVERY - Host an-worker1131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [16:53:31] RECOVERY - Host an-worker1132.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [16:53:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [16:55:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doc2001.codfw.wmnet [16:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM miscweb2002.codfw.wmnet [16:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:42] (03PS1) 10MSantos: wikifeeds: bump to 2021-11-17-200630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740888 [16:59:35] (03CR) 10Samuel (WMF): "Any updates on this?" [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [16:59:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM miscweb2002.codfw.wmnet [16:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:11] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:00:14] ✅ [17:00:26] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) Account is created; I gather the usual approach is to instruct puppet to write a configuration file with the relevant details in it (taken from profile::thanos::swift::accoun... [17:02:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mwdebug2001.codfw.wmnet [17:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) @Daimona Alright! sooo.. a consistency check on our side complains if the email in LDAP (wikitech) does not match the one we use in the admins module in puppet repo. Since I asked you to change... [17:04:09] (03CR) 10Andrew Bogott: [C: 03+2] rsyslogd: send cinder-backup to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/740865 (owner: 10Majavah) [17:04:52] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10Dzahn) There is no replacing but yea, it has been shut down and is in the process of being deleted. [17:05:29] (03PS3) 10Dzahn: logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) [17:05:41] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [17:05:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mwdebug2001.codfw.wmnet [17:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:01] (03PS4) 10Dzahn: logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) [17:09:10] (03PS1) 10Nray: Increase reading depth sampling rate to .1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740892 (https://phabricator.wikimedia.org/T294777) [17:09:16] (03CR) 10Cwhite: [C: 03+1] logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:09:19] (03PS1) 10Vgutierrez: haproxy:tls_teminator: Allow configuring http-use [puppet] - 10https://gerrit.wikimedia.org/r/740893 (https://phabricator.wikimedia.org/T290005) [17:10:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mwdebug2002.codfw.wmnet [17:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [17:11:18] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM doh2001.wikimedia.org [17:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32590/console" [puppet] - 10https://gerrit.wikimedia.org/r/740893 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:13:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy:tls_teminator: Allow configuring http-use [puppet] - 10https://gerrit.wikimedia.org/r/740893 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:13:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:13:24] PROBLEM - Bird Internet Routing Daemon on doh2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:13:36] ^ expected [17:13:48] PROBLEM - Bird Internet Routing Daemon on durum2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:14:02] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:14:09] ^ likewise, doh2* and durum2* [17:14:20] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM doh2001.wikimedia.org [17:14:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:23] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM doh2001.wikimedia.org [17:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mwdebug2002.codfw.wmnet [17:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:01] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM doh2001.wikimedia.org [17:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:27] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10RhinosF1) You cant change wikitech to volunteer or you'll get an alert that WMF access is being used by a non wikimedia email. [17:15:47] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on doh2001.wikimedia.org with reason: apply new KVM machine settings [17:15:48] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on doh2001.wikimedia.org with reason: apply new KVM machine settings [17:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:53] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on doh2002.wikimedia.org with reason: apply new KVM machine settings [17:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:55] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on doh2002.wikimedia.org with reason: apply new KVM machine settings [17:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:14] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on durum2001.codfw.wmnet with reason: apply new KVM machine settings [17:16:16] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum2001.codfw.wmnet with reason: apply new KVM machine settings [17:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:19] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on durum2002.codfw.wmnet with reason: apply new KVM machine settings [17:16:21] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum2002.codfw.wmnet with reason: apply new KVM machine settings [17:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:44] (03PS1) 10Majavah: puppet_alert: Condider zero resources a failure [puppet] - 10https://gerrit.wikimedia.org/r/740897 [17:17:49] (03CR) 10Dzahn: [C: 03+2] logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:17:51] (03PS1) 10Vgutierrez: cache:haproxy: Set http-reuse to always [puppet] - 10https://gerrit.wikimedia.org/r/740898 (https://phabricator.wikimedia.org/T290005) [17:17:52] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:54] (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:18:36] (03CR) 10Clare Ming: [C: 03+1] Increase reading depth sampling rate to .1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740892 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [17:18:53] (03PS2) 10Majavah: puppet_alert: Condider zero resources a failure [puppet] - 10https://gerrit.wikimedia.org/r/740897 [17:18:55] (03CR) 10Dzahn: [C: 03+2] "waited over night to make sure cloud VPS instances had time to remove this as well, merging now" [puppet] - 10https://gerrit.wikimedia.org/r/740693 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:19:36] RECOVERY - Bird Internet Routing Daemon on doh2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:20:12] RECOVERY - Bird Internet Routing Daemon on durum2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:21:14] (03PS1) 10Majavah: Replace wmflabs-project with wmcs-project in various scripts [puppet] - 10https://gerrit.wikimedia.org/r/740900 [17:21:23] (03CR) 10Dzahn: [C: 03+1] "this seems like we can deploy it, minus the nit comment I guess" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [17:21:58] RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:22:22] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:24:34] (03CR) 10Dzahn: [C: 03+1] "lgtm, at least on my instance in my project the content of /etc/wmcs-project and /etc/wmflabs-project is identical" [puppet] - 10https://gerrit.wikimedia.org/r/740900 (owner: 10Majavah) [17:25:10] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2012.codfw.wmnet with OS stretch [17:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2012.codfw.wmnet with OS stretch completed: - ms-fe2012 (*... [17:28:54] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM doh2002.wikimedia.org [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:12] (03PS1) 10Dzahn: rename base/files/labs to base/files/cloud [puppet] - 10https://gerrit.wikimedia.org/r/740903 [17:30:07] (03CR) 10Dzahn: "I did not rename the Hiera keys in this change, that seemed too tricky given how Hiera can be in many different places." [puppet] - 10https://gerrit.wikimedia.org/r/740903 (owner: 10Dzahn) [17:31:48] !log upgrading msw's in row D eqiad T259758 [17:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:52] T259758: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 [17:33:06] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:36] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh2002.wikimedia.org [17:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:17] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM durum2001.codfw.wmnet [17:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:22] PROBLEM - Host mw1360.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:34:22] PROBLEM - Host mw1362.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:34:32] PROBLEM - Host mw1352.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:35:00] PROBLEM - Host rdb1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:35:10] PROBLEM - Host elastic1061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:35:41] !log T295478 start snapshot of commonswiki_file from cirrus codfw -> swift eqiad [17:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:45] T295478: Searching on Special:Search and MediaSearch on Commons returns error - https://phabricator.wikimedia.org/T295478 [17:39:19] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) Well, then please escalate to infra foundations at this point please. [17:40:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32591/console" [puppet] - 10https://gerrit.wikimedia.org/r/740898 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:40:28] RECOVERY - Host mw1360.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [17:40:28] RECOVERY - Host mw1362.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [17:41:08] RECOVERY - Host rdb1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:41:16] RECOVERY - Host elastic1061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [17:42:06] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2001.codfw.wmnet [17:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:12] RECOVERY - Host mw1352.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.48 ms [17:43:56] PROBLEM - Check systemd state on durum2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache:haproxy: Set http-reuse to always [puppet] - 10https://gerrit.wikimedia.org/r/740898 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:44:30] PROBLEM - Host elastic1062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:12] (03CR) 10Legoktm: [C: 03+1] Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119) (owner: 10Muehlenhoff) [17:45:22] PROBLEM - Host dbstore1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:22] PROBLEM - Host kubernetes1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:22] PROBLEM - Host kubernetes1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:22] PROBLEM - Host pki-root1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:23] (03PS3) 10Dzahn: wikimania_scholarships: delete module and profile, remove from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) [17:45:26] PROBLEM - Host snapshot1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:49] (03CR) 10Dzahn: [C: 03+2] wikimania_scholarships: delete module and profile, remove from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:46:00] RECOVERY - Host elastic1062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.89 ms [17:46:16] PROBLEM - Host rdb1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:46:46] PROBLEM - Host an-druid1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:26] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ssingh) [17:47:38] (03CR) 10Dzahn: "noop on miscweb1002/2002" [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:47:47] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM durum2002.codfw.wmnet [17:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] RECOVERY - Host kubernetes1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.93 ms [17:48:22] RECOVERY - Host an-druid1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [17:48:52] RECOVERY - Host snapshot1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.02 ms [17:49:17] !log miscweb1002 - rm -rf /srv/deployments/scholarships (T243037) [17:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:21] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [17:49:39] (03CR) 10Dzahn: "< mutante> !log miscweb1002 - rm -rf /srv/deployments/scholarships (T243037)" [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:51:12] RECOVERY - Host dbstore1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [17:51:22] PROBLEM - Host db1114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:24] RECOVERY - Host kubernetes1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [17:51:44] RECOVERY - Host rdb1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [17:51:46] RECOVERY - Host pki-root1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [17:53:18] PROBLEM - Host cloudgw1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:18] PROBLEM - Host cloudvirt1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:18] PROBLEM - Host cloudvirt1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:19] PROBLEM - Host cloudvirt1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:19] PROBLEM - Host cloudvirt1037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:20] PROBLEM - Host cloudvirt1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:24] PROBLEM - Host ganeti1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:54:13] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-11-17-200630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740888 (owner: 10MSantos) [17:54:52] (03PS1) 10Dzahn: cache/text_haproxy: remove scholarships.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/740907 (https://phabricator.wikimedia.org/T243037) [17:55:30] PROBLEM - Host cloudvirt1042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:55:36] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "Host scholarships.wikimedia.org not found: 3(NXDOMAIN)" [puppet] - 10https://gerrit.wikimedia.org/r/740907 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:55:37] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2002.codfw.wmnet [17:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] PROBLEM - Host cloudvirt1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:56:12] RECOVERY - Host cloudvirt1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:56:27] RECOVERY - Host db1114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [17:56:54] RECOVERY - Host cloudvirt1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:57:48] (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-11-17-200630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/740888 (owner: 10MSantos) [17:58:04] PROBLEM - Host an-test-worker1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1367.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1366.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1368.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1369.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1370.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1371.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:14] PROBLEM - Host mw1373.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:15] PROBLEM - Host mw1372.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:15] PROBLEM - Host mw1374.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:16] PROBLEM - Host mw1375.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:16] PROBLEM - Host mw1377.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:17] PROBLEM - Host mw1376.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:17] PROBLEM - Host mw1378.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:18] PROBLEM - Host mw1380.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:18] PROBLEM - Host mw1379.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:19] PROBLEM - Host mw1381.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:19] PROBLEM - Host mw1382.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:20] PROBLEM - Host aqs1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:30] RECOVERY - Host cloudgw1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [17:58:30] RECOVERY - Host cloudvirt1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [17:58:31] RECOVERY - Host cloudvirt1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [17:58:31] RECOVERY - Host cloudvirt1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [17:58:36] RECOVERY - Host ganeti1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [17:59:00] PROBLEM - Host es1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:05] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [17:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:08] RECOVERY - Host mw1371.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [18:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1800). [18:00:42] RECOVERY - Check systemd state on durum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:43] !log systemctl reset-failed ifup@ens5.service on durum2001 T273026 [18:00:44] RECOVERY - Host cloudvirt1042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [18:00:44] RECOVERY - Host mw1375.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.04 ms [18:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:50] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [18:01:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:14] RECOVERY - Host cloudvirt1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [18:01:41] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [18:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:30] RECOVERY - Host an-test-worker1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [18:03:40] RECOVERY - Host mw1367.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [18:03:40] RECOVERY - Host mw1366.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:03:40] RECOVERY - Host mw1368.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [18:03:40] RECOVERY - Host mw1369.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [18:03:40] RECOVERY - Host mw1370.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [18:03:40] RECOVERY - Host mw1373.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [18:03:40] RECOVERY - Host mw1372.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [18:03:41] RECOVERY - Host mw1374.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [18:03:41] RECOVERY - Host mw1377.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [18:03:42] RECOVERY - Host mw1378.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [18:03:42] RECOVERY - Host mw1380.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [18:03:43] RECOVERY - Host mw1379.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:03:43] RECOVERY - Host mw1381.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [18:03:44] RECOVERY - Host mw1382.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [18:03:44] RECOVERY - Host aqs1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [18:04:09] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [18:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:24] RECOVERY - Host es1023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:07:58] (03PS1) 10MSantos: push-notifications: bump to 2021-11-17-193343-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/740909 [18:10:02] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.70 ms [18:14:31] (03CR) 10MSantos: [C: 03+2] push-notifications: bump to 2021-11-17-193343-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/740909 (owner: 10MSantos) [18:14:32] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:53] (03Merged) 10jenkins-bot: push-notifications: bump to 2021-11-17-193343-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/740909 (owner: 10MSantos) [18:18:14] !log upgrading msw-c1-eqiad T259758 [18:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:18] T259758: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 [18:18:29] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:48] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-fe2011, ms-fe2010, cloudcephmon1002, cloudcephmon1003, cloudcephmon1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:20:46] (03PS1) 10Majavah: Delete roles for bare metal WMCS puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/740913 [18:22:34] RECOVERY - Host mw1376.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [18:28:44] (03PS1) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [18:29:12] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:31] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [18:32:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) [18:32:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) [18:34:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) 05Open→03Resolved complete [18:36:24] (03CR) 10Majavah: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [18:37:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ssingh) [18:45:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:49:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:53:16] (03PS2) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [18:55:03] (03PS3) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [18:56:54] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:00:04] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211123T1900). [19:00:04] jdrewniak, tgr, and nray: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] hello o/ [19:00:33] * urbanecm waves [19:01:13] o/ [19:01:20] i can deploy today [19:01:52] thank you urbanecm [19:01:57] jan_drewniak: am i missing something, or are some of the SVGs not optimized? [19:02:01] like static/images/mobile/copyright/pl-wordmark.svg [19:02:05] looks to have a lot of whitespace [19:02:18] (03CR) 10Urbanecm: [C: 03+2] Increase reading depth sampling rate to .1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740892 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [19:02:25] tgr: around? [19:03:08] (03Merged) 10jenkins-bot: Increase reading depth sampling rate to .1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740892 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [19:03:37] urbanecm: thanks for pointing that out. For some reason I thought Volker uploaded a new patch with the optimized logos. [19:04:09] nray: pulled to mwdebug1001, can you have a look? [19:04:15] jan_drewniak: can you fix it? :-) [19:04:15] yes, thank you [19:04:24] thanks nray, let me know how it goes [19:05:01] urbanecm: not in time for this backport. That's fine I'll schedule it for tomorrow [19:05:09] okay, sounds good [19:05:10] thanks for pointing that out! [19:05:13] any time! [19:05:20] do you want me to put the note on the gerrit patch too? [19:06:20] ah, wait wait, patchset 3 was Volker approved. [19:06:45] (03PS4) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [19:06:53] The SVGs have some whitespace but it's not significant. I think it's there for legibility [19:07:06] urbanecm: things look good, you can proceed! [19:07:13] nray: thanks, syncing [19:08:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:08:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3993aacbfdbbfb6cdcc198ce369bf08b32ace865: Increase reading depth sampling rate to .1% (T294777) (duration: 00m 57s) [19:08:51] (03PS1) 10Legoktm: Revert "Add echo-cross-wiki-notifications to DefaultUserOptions" [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740872 (https://phabricator.wikimedia.org/T296270) [19:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:52] T294777: Restore reading depth schema - https://phabricator.wikimedia.org/T294777 [19:08:56] nray: should be lie [19:09:06] thanks urbanecm I appreciate it! [19:09:09] np [19:10:09] urbanecm: can I add the above Echo patch into your queue? Or whenever you're done I can do it myself [19:10:18] urbanecm: here [19:10:26] legoktm: feel free to +2 it, and I'll ping you :) [19:10:39] (03CR) 10Legoktm: [C: 03+2] Revert "Add echo-cross-wiki-notifications to DefaultUserOptions" [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740872 (https://phabricator.wikimedia.org/T296270) (owner: 10Legoktm) [19:10:43] urbanecm: can we do the logos today in that case? [19:11:09] jan_drewniak: i was reviewing it, sorry. i checked some other SVGs, and the little whitespace sounds to be a norm, so let's do it [19:11:11] 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10Majavah) [19:11:14] (03PS6) 10Urbanecm: Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [19:11:19] (03CR) 10Urbanecm: [C: 03+2] Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [19:11:27] 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10Majavah) [19:11:31] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10User-dcaro, and 2 others: Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 (10Majavah) [19:11:39] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Majavah) [19:11:43] (03CR) 10Urbanecm: [C: 03+2] Cherry-picked small fixes [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740694 (owner: 10Gergő Tisza) [19:11:49] (03CR) 10Urbanecm: [C: 03+2] Structured task caching/filtering cherry-picks, step 1 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740774 (owner: 10Gergő Tisza) [19:11:56] hey tgr, at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/740775, wouldn't adding the param to extension.json cause errors in the API? [19:12:01] or is passing unrecognized services fine? [19:12:06] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:08] (03Merged) 10jenkins-bot: Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [19:12:26] 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks, 10User-jbond: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10jbond) p:05Triage→03High [19:12:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:35] jan_drewniak: mwdebug1001 has the patch. can you check? [19:13:13] urbanecm: yup, checking now [19:13:18] thanks [19:13:25] urbanecm: passing an *unrecognized* service is definitely bad. But unless I messed up step 1 includes everything needed for the service to be recognized. [19:13:43] 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks, 10User-jbond: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10jbond) p:05High→03Medium [19:13:43] and passing an unexpected extra parameter is generally fine in PHP [19:13:44] sorry, i meant unexpected by the API [19:13:55] (03PS5) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [19:14:32] if you say that's fine, let's d oit [19:14:34] *do it [19:14:39] (03CR) 10Urbanecm: [C: 03+2] Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [19:14:45] (03CR) 10Urbanecm: [C: 03+2] Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 (owner: 10Gergő Tisza) [19:15:01] (03CR) 10Urbanecm: [C: 03+2] Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [19:15:13] AFAIK passing more arguments than the function signature is always OK. Before the splat/spread operator that was the mechnacism for vararg functions and it's still kept for B/C. [19:15:39] legoktm: your revert will have a CI failure [19:15:42] tgr: ack [19:15:46] thanks [19:15:48] bleh [19:15:49] urbanecm: ok looks good [19:15:56] thanks jan_drewniak, syncing [19:16:50] (03CR) 10Majavah: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:17:52] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: bf82bfb3ddcaff04a1e90abc435ccb26f792780c: Add new icons, wordmarks & taglines for several wikis (T290091; 1/2) (duration: 00m 56s) [19:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] T290091: Create logos and prepare new set of pilot wikis for deployment - https://phabricator.wikimedia.org/T290091 [19:18:20] legoktm: I'm not sure what "bleh" means precisely, but i assume you're on it [19:18:48] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bf82bfb3ddcaff04a1e90abc435ccb26f792780c: Add new icons, wordmarks & taglines for several wikis (T290091; 2/2) (duration: 00m 56s) [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:56] jan_drewniak: should be live [19:19:04] yes, I think master itself is broken [19:19:39] great :) (that you're on it) [19:19:42] !bash legoktm: I'm not sure what "bleh" means precisely, but i assume you're on it [19:19:43] majavah: Stored quip at https://bash.toolforge.org/quip/MOc9Tn0Ba_6PSCT9WA03 [19:19:52] https://www.youtube.com/watch?v=cTmm-DEs0ko [19:20:42] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:20:56] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1244.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:21:04] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1254.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:22:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:23:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:27:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add echo-cross-wiki-notifications to DefaultUserOptions" [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740872 (https://phabricator.wikimedia.org/T296270) (owner: 10Legoktm) [19:29:47] (03PS1) 10Legoktm: Suppress SecurityCheck-DoubleEscaped in DiscussionParser [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740873 [19:29:59] (03PS2) 10Legoktm: Revert "Add echo-cross-wiki-notifications to DefaultUserOptions" [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740872 (https://phabricator.wikimedia.org/T296270) [19:33:33] ok, phan is passing now [19:34:00] legoktm: so, we should give it a second +2 try? [19:34:13] yes [19:34:18] (03CR) 10Legoktm: [C: 03+2] Suppress SecurityCheck-DoubleEscaped in DiscussionParser [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740873 (owner: 10Legoktm) [19:35:42] (03Merged) 10jenkins-bot: Cherry-picked small fixes [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740694 (owner: 10Gergő Tisza) [19:36:31] tgr: first patch is at mwdebug1001 if you want to have a look [19:36:47] (I plan to sync files individually and then do one sync-world at the end for i18n bits) [19:37:23] sure [19:37:34] sounds like a good plan [19:37:45] great :) [19:38:45] and...more CI errors! [19:38:46] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-selenium-docker/89298/console [19:38:49] i don't like em :/ [19:38:56] but since it's selenium, i guess try again? [19:40:18] mwdebug looks good at a glance [19:40:21] (03Merged) 10jenkins-bot: Structured task caching/filtering cherry-picks, step 1 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740774 (owner: 10Gergő Tisza) [19:40:22] :( that looks like a WikibaseLexeme error, you can probably try again yeah [19:40:25] (03CR) 10jerkins-bot: [V: 04-1] Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [19:40:31] (03CR) 10jerkins-bot: [V: 04-1] Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 (owner: 10Gergő Tisza) [19:40:37] (03CR) 10jerkins-bot: [V: 04-1] Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [19:40:39] tgr: thanks, syncing [19:41:24] the selenium error is something in lexeme [19:41:39] yup, per Lucas, I'll +2 it again [19:41:59] (03CR) 10Urbanecm: [C: 03+2] Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [19:42:09] (03CR) 10Urbanecm: Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 (owner: 10Gergő Tisza) [19:42:15] (03CR) 10Urbanecm: [C: 03+2] Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 (owner: 10Gergő Tisza) [19:42:20] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/: c26e407118e1cd8e1e3fea6e2f4e3e43a609ea62: GrowthExperiments backports (duration: 01m 03s) [19:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:37] and first batch live [19:42:39] this is the simple patch, in any case [19:42:50] (03CR) 10Urbanecm: [C: 03+2] Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [19:42:50] the three-step one is slightly risky [19:43:18] yeah [19:44:13] step 1 is on mwdebug1001 right? seems to be working [19:44:38] that's weird, because i just pulled it there [19:44:44] i missed it got merged as we talked [19:44:56] oh, I just checked that nothing is broken [19:45:01] ah [19:45:04] can you check again please? [19:45:19] I don't think the functionality is meaningfully split between the three steps [19:47:01] so should i just sync? [19:47:13] or wait for all three to merge, test and sync as quickly as possible? [19:47:30] scap sync-quickly [19:47:37] if that only existed [19:47:54] still working [19:48:11] no, it should be synced separately [19:48:32] so sync now i guess [19:48:35] otherwise it might blow up if dependencies get synced in the wrong order [19:48:38] yes [19:49:18] yeah, if i was waiting for everything to merge, I'd need some git-fu to do the deployments [19:49:21] anyway, sycing [19:49:24] *syncing [19:49:30] I don't *think* the patches include any super-common hook like SpecialPage_initList that we got into trouble with the last time, but better safe than sorry [19:49:45] true [19:51:01] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/: 7d5f779a73594bb11f359bda055f2c7af8e92feb: Structured task caching/filtering cherry-picks, step 1 (duration: 00m 56s) [19:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:23] second part live [19:52:23] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:42] looks good [19:53:08] waiting for CI now [19:54:10] legoktm: according to https://integration.wikimedia.org/zuul/, your patch will be merged soon, and looks like I'll have some 20 minutes after that, as the other GE backports are queued now. [19:54:24] i think you should go ahead once echo merges, and lmk once you're done [19:54:30] ok [19:57:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:58:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:15] (03Merged) 10jenkins-bot: Suppress SecurityCheck-DoubleEscaped in DiscussionParser [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740873 (owner: 10Legoktm) [20:01:23] (03Merged) 10jenkins-bot: Revert "Add echo-cross-wiki-notifications to DefaultUserOptions" [extensions/Echo] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740872 (https://phabricator.wikimedia.org/T296270) (owner: 10Legoktm) [20:01:31] whee [20:02:42] verified on mwdebug1001 [20:04:21] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/Echo/: re-enable cross-wiki notifications by default (T296270) (duration: 00m 57s) [20:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] T296270: Cross-wiki notifications don't appear to be working - https://phabricator.wikimedia.org/T296270 [20:04:37] notifications working again over here too \o/ [20:04:51] :D [20:07:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:49] urbanecm: I'm all done [20:08:01] great [20:08:08] * urbanecm is waiting for CI [20:09:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10Cmjohnson) [20:10:14] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:11:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10Cmjohnson) 05Open→03Resolved @Jgreen this server is ready for you, BIOS setup, idrac setup (mgmt password is generic), ports have been ena... [20:12:11] (03PS1) 10Dduvall: mediawiki: Install yaml extension for use by SettingsBuilder [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) [20:16:26] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [20:20:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:21:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:05] (03Merged) 10jenkins-bot: Structured task caching/filtering cherry-picks step 2 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740775 (owner: 10Gergő Tisza) [20:25:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:25:09] (03Merged) 10jenkins-bot: Structured task caching/filtering cherry-picks step 3 [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740776 (owner: 10Gergő Tisza) [20:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:15] (03Merged) 10jenkins-bot: Add Image: Validate GEInfoboxTemplates size [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740777 (https://phabricator.wikimedia.org/T294518) (owner: 10Gergő Tisza) [20:41:59] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) a:03Papaul [20:42:53] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) Dell Tech Support to me Hello Papaul, The support assist logs show that there are firmware updates that need to be installed. Would you please install the dell system update utility then... [20:44:23] urbanecm: I think the remaining patches are ready to go [20:45:00] Oh, sorry, stopped paying attention tgr [20:45:19] I can take over if you prefer [20:45:37] That'd be great tgr [20:45:58] thanks for the deployments so far! [20:50:25] (03PS1) 10Mforns: analytics:refinery:job:refine_sanitize: Fix refine_monitor offsets [puppet] - 10https://gerrit.wikimedia.org/r/740931 [20:50:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1076681445. [20:53:32] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10Papaul) [20:59:48] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:00:02] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:32] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: ms-fe2011, cloudcephmon1003, cloudcephmon1001, ms-fe2010, ms-fe2012, cloudcephmon1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:01:02] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:02:00] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:04:35] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) All of the mgmt cables have been moved to the new switches, netbox updates are still needed. [21:12:44] PROBLEM - DNS on mw1376.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.2.135 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:17:12] (03PS1) 10Papaul: Add wdqs200[9,10,11,12} to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740936 (https://phabricator.wikimedia.org/T294297) [21:23:16] (03CR) 10Papaul: [C: 03+2] Add wdqs200[9,10,11,12} to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740936 (https://phabricator.wikimedia.org/T294297) (owner: 10Papaul) [21:26:54] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:28:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS buster [21:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 3 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2009.codfw.wmnet with OS buster [21:35:27] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:740775|Structured task caching/filtering cherry-picks step 2]] (duration: 00m 57s) [21:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:05] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/Api/ApiQueryGrowthTasks.php: Backport: [[gerrit:740776|Structured task caching/filtering cherry-picks step 3]] (duration: 00m 55s) [21:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:20] (03PS1) 10Andrew Bogott: cinder.conf: Increase rpc timeout to 5 minute. [puppet] - 10https://gerrit.wikimedia.org/r/740938 [21:40:41] (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: Increase rpc timeout to 5 minute. [puppet] - 10https://gerrit.wikimedia.org/r/740938 (owner: 10Andrew Bogott) [21:47:15] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:740777|Add Image: Validate GEInfoboxTemplates size (T294518)]] (duration: 00m 56s) [21:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:19] T294518: Add Image: Allow disabling feature via community configuration - https://phabricator.wikimedia.org/T294518 [21:47:44] !log tgr@deploy1002 Started scap: (no justification provided) [21:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:49] (03PS1) 10Papaul: Fix partman config for new codfw wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/740939 (https://phabricator.wikimedia.org/T294297) [21:52:07] (03CR) 10Papaul: [C: 03+2] Fix partman config for new codfw wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/740939 (https://phabricator.wikimedia.org/T294297) (owner: 10Papaul) [21:53:24] !log krinkle@deploy1002 Started deploy [integration/docroot@a3435a7]: (no justification provided) [21:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:32] !log krinkle@deploy1002 Finished deploy [integration/docroot@a3435a7]: (no justification provided) (duration: 00m 07s) [21:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2009.codfw.wmnet with OS buster [21:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 3 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2009.codfw.wmnet with OS buster executed wi... [21:57:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS buster [21:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:48] !log tgr@deploy1002 Finished scap: (no justification provided) (duration: 10m 03s) [21:57:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 3 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2009.codfw.wmnet with OS buster [21:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:37] !log UTC evening deploys done [21:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:43] took a while [21:59:12] wow [21:59:44] 10SRE: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Aklapper) [22:28:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2009.codfw.wmnet with OS buster [22:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2009.codfw.wmnet with OS buster completed:... [22:37:59] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: cloudcephmon1002, cloudcephmon1001, ms-fe2011, ms-fe2012, ms-fe2010, wdqs2009, cloudcephmon1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:40:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2010.codfw.wmnet with OS buster [22:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2010.codfw.wmnet with OS buster [22:44:49] (03CR) 10Urbanecm: [DNM] snapshot: Dump information about Growth mentorship (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [23:11:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2010.codfw.wmnet with OS buster [23:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2010.codfw.wmnet with OS buster completed:... [23:12:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2011.codfw.wmnet with OS buster [23:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2011.codfw.wmnet with OS buster [23:35:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:39:18] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Dzahn) [23:40:16] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms [23:42:53] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10Dzahn) We don't use hhvm anymore. fluorine does not exist anymore. It had been said years ago " In light of that perhaps this ticket is should be... [23:43:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2011.codfw.wmnet with OS buster [23:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2011.codfw.wmnet with OS buster completed:... [23:43:52] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) Started another round of imports today to see how it goes. If it doesn't fall over might as well call this done... [23:44:40] 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dzahn) [23:49:13] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849 (10EBernhardson) 05Open→03Declined might as well close it, this should mostly be irrelevant infrastructure. [23:50:53] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [23:53:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2012.codfw.wmnet with OS buster [23:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2012.codfw.wmnet with OS buster