[01:22:25] 10serviceops, 10Data-Persistence-Backup, 10serviceops-collab, 10GitLab (Infrastructure), and 2 others: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) ` 01:04 <+icinga-wm> PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed:... [02:19:27] 10serviceops, 10Performance-Team, 10SRE, 10Traffic: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:19:48] 10serviceops, 10Performance-Team, 10SRE, 10Traffic: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work. [07:20:05] (ProbeDown) firing: (8) Service gitlab1001:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:20:10] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10phaultfinder) [07:26:38] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10fgiunchedi) I recently fixed the `prometheus::blackbox::check::http` definition to do the right thing and honor the `team` label. This is the result (i.e. working as expected). At any rate, please check the alerts for... [07:46:18] the above alert has a page hashtag, but didn't actually page, is that expected? [07:46:38] is also firing just in this chan AFAICT [07:47:06] <_joe_> volans: yes, I think mutante and jelto set it up with different paging rules [07:55:48] give me some time to catch up whats configured here and whats happening. gitlab1001 is not in use anymore, so that's not critical. I'll find out if the hashtag is needed [09:14:29] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Install php 7.4 in production - https://phabricator.wikimedia.org/T311386 (10Joe) [10:12:47] Hi, there is one last action item for this ticket for maps: https://phabricator.wikimedia.org/T305845#8048200 Can somebody help us with depooling/pooling the prod service? [11:20:20] (ProbeDown) firing: (8) Service gitlab1001:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:11] nemo-yiannis: yup. I can do it right now, is that ok? [12:01:28] yes [13:18:20] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [14:15:23] 10serviceops, 10Patch-For-Review: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 (10akosiaris) Adding @ottomata too since conf100* hosts also run zookeeper [14:17:05] !log pool codfw for kartotherian T305845 [14:17:32] wrong channel, done again in -operations [14:22:51] nemo-yiannis: done! [14:36:48] thanks akosiaris [14:54:44] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10akosiaris) Final step done! Should I re-resolve or is there anything left pending? [14:56:01] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10Jgiannelos) [14:56:41] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10Jgiannelos) 05Open→03Resolved [15:20:20] (ProbeDown) firing: (8) Service gitlab1001:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:23] 10serviceops, 10Patch-For-Review: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 (10Ottomata) cc @BTullis and @JAllemandou too [16:03:04] 10serviceops, 10Maps: maps{2007,2010} DBs running out of connections - https://phabricator.wikimedia.org/T312239 (10Jgiannelos) [16:03:38] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10Jgiannelos) 05Resolved→03Open [16:25:04] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10Dzahn) a:03Dzahn It was intentional to just see if it works. (and not sure if there was a way to test those before hand). The expectation was that we would get automatically created tickets, email and IRC notificatio... [16:26:09] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10Dzahn) > description: gitlab1004:443 failed We configured the checks to test gitlab.wikimedia.org, not gitlab1004:443. [17:10:44] 10serviceops, 10Machine-Learning-Team, 10ORES, 10SRE: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10akosiaris) 05Open→03Resolved a:03akosiaris Done a long time ago. Now [misc_redis](https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc) is being us... [17:21:01] 10serviceops, 10Maps: maps{2007,2010} DBs running out of connections - https://phabricator.wikimedia.org/T312239 (10Jgiannelos) I manually created the missing indexes and it looks like the nodes are recovering. [17:31:17] 10serviceops, 10Maps: maps{2007,2010} DBs running out of connections - https://phabricator.wikimedia.org/T312239 (10Jgiannelos) 05Open→03Resolved [17:31:17] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10Jgiannelos) [17:31:17] 10serviceops, 10Maps, 10Patch-For-Review: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 (10Jgiannelos) 05Open→03Resolved [19:20:20] (ProbeDown) firing: (8) Service gitlab1001:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:46] I can't confirm this problem. [19:27:08] and it's unrelated to new monitoring that was added using blackbox checks [19:27:37] or if it is, then it should not have the "p.age" string in it [19:27:48] if it is.. then https://phabricator.wikimedia.org/T312194 [19:33:40] how do you ACK a jinxer-wm alert? [19:33:47] is there such a concept? [19:34:08] "silence" is like disabling notifications and not really the same I guess [19:42:40] alerts.wikimedia.org [20:17:44] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10Dzahn) I added silences in alerts.wikimedia.org for all of these. silence feels like disabling notifications though. What I really want is "ACK" or scheduled downtime. But it seems like those concepts don't exist anymore. [21:46:40] 10serviceops, 10Analytics, 10Data-Engineering, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10JArguello-WMF) @Ottomata Should we remove this task from Analytics to Data Engineering? [21:49:23] 10serviceops, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JArguello-WMF) [22:00:54] 10serviceops, 10DNS, 10SRE, 10Traffic, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this... [22:22:01] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) The change has been approved and then deployed. On gitlab-runner1002 I saw puppet ad... [22:34:20] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) 05Open→03Resolved ` [gitlab-runner1002:~] $ for relenguser in brennen dancy dduv... [23:20:20] (ProbeDown) firing: (8) Service gitlab1001:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:22:47] ^ in the process of decom'ing it [23:23:00] firewall change [23:23:07] before the cookbook ran [23:24:19] silenced [23:27:10] now decom cookbook included 'Downtimed host on Icinga/Alertmanager [23:30:09] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10phaultfinder)