[07:05:39] * dhinus paged HarborComponentDown tools [07:08:33] alert text: "Got no data for any component, all might be down or unresponsive, toolforge might be unable to pull images" [07:14:27] hmm, the harbor ui ended up loading for me after a while, but extremely slowly [07:15:08] trying to log in, it says "Core service is not available." [07:15:58] there seems to be a problem with trove [07:16:17] https://www.irccloud.com/pastebin/a7tzSKG5/ [07:16:48] I can try restarting the trove db [07:16:50] (from the harbor-core logs) [07:17:03] didn't andrew do something on trove recently? [07:18:03] I think he was planning to upgrade the harbor instance next Monday :) [07:18:48] ok there are some errors in the logs that can be seen from horizon [07:18:58] project: tools, database->instances [07:19:39] "in create_backup raise exception.TroveError" [07:19:58] I tried "restart instance" [07:20:04] "Unable to restart" [07:20:33] not possible to ssh to the trove host either [07:20:45] I'm logging in via ssh [07:20:55] did it fail for you? [07:21:17] you need a special key [07:21:29] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs [07:21:40] ah, that might be it [07:23:08] hmm it's not connecting, just hanging [07:27:25] I cannot connect to any trove instances apparently [07:29:02] I will try opening a virtsh console from the cloudvirt host [07:29:05] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Root_console_access [07:31:16] I can connect but it asks for a password [07:31:27] oh I know why I cannot connect from ssh, I need to enable the security group [07:31:58] added "ssh-from-anywhere" to the instance tools-harbordb in project trove [07:34:13] ok I'm in the instance [07:34:35] the docker container is in a restart loop [07:35:04] "could not write to file "pg_wal/xlogtemp.12": No space left on device" [07:35:31] the cinder volume /dev/sdb is full [07:36:46] it's the postgres data dir so there's nothing we can delete easily [07:36:54] we need to enlarge the volume I think [07:38:24] can it be resized via horizon? [07:39:11] maybe? I never did it but I know it's possible :) [07:39:54] I stopped the container and unmounted the volume [07:40:07] https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances#Extend_a_volume [07:40:38] this is also a Trove instance which might complicate things... because Trove is managing the volume [07:42:30] there's some additional info here https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/Trove§ion=10#Instance_is_down [07:42:40] hmm the alert resolved by itself at 9:39 but Harbor is still down [07:44:09] thanks for that link! there's a "resize volume" in the database->instance section of horizon [07:44:47] re-mounted the volume and I'll try that one, I guess Trove will take care of umounting/mounting [07:44:56] resizing from 8 to 16 [07:45:11] "Quota exceeded for resources: ['volumes']" LOL [07:46:42] oh no xd [07:46:48] morning [07:47:03] there seems to be plenty of quota though [07:47:15] in the trove project? [07:47:17] dcaro: o/ [07:47:21] (/me reading backlog) [07:47:21] I checked both tools and trove [07:47:25] ack [07:48:36] dcaro: morning [07:49:31] so tools harbor db is having disk space issues and thus harbor is down in tools? [07:50:06] yes [07:50:15] anyone is acting as IC? I'll do if not [07:51:03] i'm mostly just following along xd [07:51:06] dcaro: thanks! [07:51:13] I think it would be nice to create the incident doc [07:51:19] * dcaro ic [07:51:26] https://docs.google.com/document/d/1s1f4Lss1Znw3oUav3We6c09QXAY7ja00IsOQyVuQQtw/edit#heading=h.95p2g5d67t9q [07:52:41] * arturo still not fully online, will be in a few minutes [07:52:45] should we send an email to cloud-announce about harbor/build service being down? [07:53:43] blancadesal: +1 [07:53:54] I think I know what the quota error is: it's trove quotas [07:53:57] not visible from horizon [07:54:26] https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/Trove§ion=10#Adjusting_per-project_Trove_quotas [07:54:29] * dhinus paged again, not sure why it resolved for a few mins [07:55:47] blancadesal: I'll handle the email, feel free to keep debugging [07:56:04] dcaro: just started the email xd [07:56:16] blancadesal: okok, please continue then [07:56:49] quota upgraded [07:58:25] \o/ [07:59:03] Success: Resizing volume [08:00:10] yay! [08:00:21] it didn't really work though :D [08:00:27] there's no volume mounted [08:00:44] I'll try mounting manually [08:01:58] it mounts but it's still at the old size :/ [08:02:44] you might need to expand the fs [08:03:45] yes, I was hoping trove would do it for me [08:03:59] lsblk shows the new size [08:04:53] "resize2fs /dev/sdb" fixed it [08:05:02] restarting the trove db from horizon [08:05:26] 🤞 [08:05:37] we're back! [08:05:42] I can log in to https://tools-harbor.wmcloud.org/ [08:06:40] \o/ [08:06:44] harbir-core logs are back to normal [08:06:55] let's try running a build [08:08:16] running the functional tests now [08:08:25] all good [08:08:33] \a/ [08:09:26] sending an update email [08:10:35] okok, I'm resolving the incident [08:10:39] :) [08:10:46] do we have a Phab ticket for this if people need to report something? [08:11:23] take a look at https://docs.google.com/document/d/1s1f4Lss1Znw3oUav3We6c09QXAY7ja00IsOQyVuQQtw/edit and add there any stuff I might have missed/got wrong (still filling stuff up though), I'll sort all the info out after [08:24:44] good work you all, great incident kung-fu [08:28:01] dhinus: there was a blip in the prometheus stats, probably from a restart thinking that it worked until it failed to do some db stuff [08:28:04] https://usercontent.irccloud-cdn.com/file/dxE92KOm/image.png [08:28:09] that might be the cause for the double alert [08:37:15] interesting, because I stopped the container with "docker stop" and I didn't see it coming up, let me check the docker logs [08:37:50] hmm I'm still seeing errors in the logs [08:38:29] WARNING: archiving write-ahead log file "0000000100000244000000B5" failed too many times, will try again later [08:39:05] there's more details, it's apparently failing to "cp pg_wal/0000000100000244000000B5 /var/lib/postgresql/data/wal_archive/0000000100000244000000B5" [08:41:17] that's trove itself [08:41:40] that was supposed to be fixed in the latest version, it's used for backups and replication, but it's not working [08:41:45] (the wal_archive thingie) [08:42:36] https://phabricator.wikimedia.org/T343683 [08:42:43] T343683 [08:42:44] T343683: [toolsbeta.harbor] trove postrgres DB out of space, v2 - https://phabricator.wikimedia.org/T343683 [08:42:48] I see [08:43:06] the annoying bit is that it's spamming the logs with thousands of those lines :) [08:43:48] I'm diving into prometheus metrics to see if we could alert/warn on this situation approaching, but it seems there is a metric reporting 0 always [08:43:49] https://thanos.wikimedia.org/graph?g0.expr=openstack_trove_instance_volume_size_gb%7Btenant_id%3D%22tools%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=openstack_trove_instance_volume_used_gb%7Btenant_id%3D%22tools%22%7D&g1.tab=0&g1.stacked=0&g1.range_input=2h&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_match [08:43:49] es=%5B%5D [08:44:38] short url: https://w.wiki/Ajv$ [08:44:54] arturo: feel free to add your notes here T354728 [08:44:54] T354728: Trove does not expose amount of disk space used - https://phabricator.wikimedia.org/T354728 [08:45:13] oh, it is known [08:46:32] dhinus: the wal was supposed to be disabled though :/ not trying to do it [08:46:52] maybe it got re-enabled somehow (as it's not the default in trove, it was manually changed) [08:48:08] I confirm those log errors about the wal started hours before the outage [08:48:25] the first "disk full" was at 6:40 UTC (25 mins before the page) [08:49:02] page was actually at 6:59 so 19 mins after the first error [08:49:20] we have 10min buffer in the alert for the prometheus stat [08:50:07] the logs don't show any sign of restarts or attempted restarts before 8:05 so it's a mystery why the prometheus metric had that glitch [08:52:00] not true, there are some restarts, sorry [08:52:53] it was in a restart loop until I did "docker stop" which was roughly at the time the alert glitched [08:53:07] so one of the restart loops probably caused the alert to become green for a moment [08:58:25] the db should actually be pretty small, in the order of 100MB [09:00:20] the number of executions in 8x smaller than last time though (2M vs 17M), so maybe the wal is also taking space [09:01:26] https://www.irccloud.com/pastebin/lT7xuYpA/ [09:01:28] yep [09:05:04] dhinus: just manually disable the wal, and restarted the database, I don't see the wal logs anymore, will cleanup the files on disk [09:06:08] let me investigate if we can disable at trove level somehow to avoid having to manually apply [09:09:36] :/, wal settings are not exposed [09:09:37] root@cloudcontrol1006:~# wmcs-openstack database configuration parameter list b0fb1fc1-47cb-4703-9e77-2c8476ac05ee | grep wal [09:10:11] oh, wait, wrong id [09:10:12] https://www.irccloud.com/pastebin/oDrnQml0/ [09:10:16] \o/ [09:14:55] oh my... I edited a task instead of creating a subtask :/ [09:26:30] the volume stats are pulled from the cinder api by the openstack exporter, using https://docs.openstack.org/api-ref/block-storage/v3/#show-quota-usage-for-a-project [09:26:50] I'm guessing that it might require some sort of openstack agent running inside the VM, but not sure :/ [09:32:42] maybe because cinder sees blocks, and only from within the VM you can see actual filesystem usage. [09:39:59] hmpf... the cinder quotas don't seem to be exposed through the openstack cli, and the cinder cli does not seem to use the cloud file [09:51:28] that's not what's used, it uses the trove api [09:51:32] and it's shown on the cli [09:51:33] https://www.irccloud.com/pastebin/ZjKTVaeN/ [09:51:58] might be a data format error somehow, it's using a different endpoint though, so maybe it's not shown in that other one [10:02:02] hmm, I think this might be related [10:02:05] https://www.irccloud.com/pastebin/ibDFCODk/ [10:02:19] from the trove guest agent on the VM [10:02:50] who/what is telling the agent to read /dev/vdb ? [10:03:35] not sure, it should be reading it from the config as far as I can see in the code [10:03:37] https://www.irccloud.com/pastebin/XD2pc6Ca/ [10:03:40] and it's there [10:03:44] (guest-agent-venv) root@tools-harbordb:/opt/guest-agent-venv# grep vdb /etc/trove/conf.d/* [10:03:48] that returns empty [10:04:20] this is suspicious [10:04:21] lib/python3.8/site-packages/trove/common/cfg.py: cfg.StrOpt('device_path', default='/dev/vdb', [10:04:30] ah!! [10:05:09] I have seen that before, being bitten by a default because a config was not being read/applied as expected [10:05:25] I think I even reported this to upstream openstack once, with neutron [10:06:13] maybe it was this one https://bugs.launchpad.net/neutron/+bug/2003534 [10:28:51] * dcaro lunch [10:29:06] will have to go to the vet right after, might take a bit more time than usual [10:29:17] 👍 [11:06:58] hmm... this looks ok: [11:07:01] https://www.irccloud.com/pastebin/qB53PKie/ [11:08:07] 2024-07-24 07:59:11.848 554 ERROR trove.guestagent.volume [None req-2da7ab1f-c952-4a74-9cf1-40005efe0ace fnegri trove - - - -] Device '/dev/vdb' is not ready.: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. [11:08:21] I think that the device might come from the openstack db, not the trove config directly [11:08:26] (as in an api call) [11:20:52] * dcaro off to vet [12:07:22] * dcaro back [13:28:55] dcaro: quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056494 [13:29:57] blancadesal: that's fixed already xd just rebased and it shows empty [13:30:10] sorry if I got you confused [13:30:11] ? [13:30:21] ah, you did it already? [13:30:31] yep https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056489 [13:30:45] it was blocking the creation of credentials for new accounts [13:31:07] thanks :) and sorry for missing that one [13:34:59] np I missed it too, the tests were passing when we merged it, but once that endpoint was not available on the envvars side it started breaking [13:35:57] another quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/449 [13:37:24] blancadesal: how did it pass before? [13:37:49] it didn't, I yolo'd it [13:38:04] 🤦‍♂️ xd [13:43:07] "doing it freed up some mental space to be paranoid about higher-stake changes" [13:43:32] * blancadesal is all about the narrative [13:43:39] xd [13:44:28] don't hesitate to ask for a second pair of eyes on anything [13:44:53] like with the envvars change? :P [13:46:12] yep (UwU) [13:49:36] for the calico upgrade, does anything need to be done here https://gitlab.wikimedia.org/repos/cloud/toolforge/calico or is it all through toolforge-deploy now? [13:49:44] I already got the new images [13:50:58] we use that yes [13:51:11] then in toolforge-deploy you can specify which version of the generated chart you want [13:52:41] you will probably need to update it with the newer templates from upstream (see the notes in the current templates) [13:53:28] is there any grafana dashboard for cloudlbs? [13:54:15] something that shows the number of requests/bandwidth/etc. for wikireplicas as reported by haproxy [13:55:19] blancadesal: not sure though if/why we use the templates and not the helm chart directly, I think it might install other things we don't need, arturo do you remember? [13:55:53] dhinus: there might be something, you have one for latency https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency?forceLogin&orgId=1&refresh=30s&var-backend=keystone-admin-api_backend&var-backend=keystone-public-api_backend&var-backend=nova-api_backend&var-cloudcontrol=cloudlb1001&var-cloudcontrol=cloudlb1002 [13:56:29] this one is out of date though :/ https://grafana.wikimedia.org/d/tanisM2Zz/wmcs-openstack-api-stats-eqiad1?orgId=1&refresh=5m [13:56:29] dcaro: afaik, we are not installing all the components. I only pushed a subset of the images. but arturo will know more [13:57:03] arturo: when you have a moment to recheck https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/434 [14:00:07] I will check later [14:57:28] only 666 tasks left from that query :) [14:57:33] ominious [14:57:50] without the i [14:58:06] hahaha [14:58:23] is that number going up or down? or are we stuck in a triaging hell? [14:58:24] 'ominious' is a more ominous version of 'ominous' [14:58:57] sometimes more is less [15:01:04] only 142 "needs triage" left in the cloud-services-team backlog [15:01:20] probably doable in 2/3 meetings? [15:01:35] I didn't count how many we triaged today [15:03:01] I think triage will be needed regardless anyhow, probably less often, or shorter, but it can't be avoided [15:04:58] yep I was thinking how long it will be before we can only triage "new" tasks :) [15:12:09] technically we cleared two years of tasks today xd [15:12:15] :D [15:12:33] not bad [15:12:40] 👍 [15:15:17] related: in the clinic duties wiki there is a link to a query to be used for triaging tasks [15:15:39] but I've always found that query to be sub-optimal, I even tried to improve it a few times [15:16:14] this is the current version: https://phabricator.wikimedia.org/maniphest/query/_oUo8wzJaVVf/ [15:18:41] * dhinus is tempted to start a decision request on "how to triage" [16:04:34] * arturo offline [16:07:11] I found a nice trick to inspect the progress of long transactions in ToolsDB replication: https://phabricator.wikimedia.org/T370760#10011275 [16:07:26] tomorrow I'll add that to wikitech as well [16:55:24] whoohoo Southparkfan, one more down! [16:58:32] andrewbogott: yes! [16:59:33] but unless is interested to provide me with assistance to move the rest over, I'm going to stop here [17:03:17] that's fine, thank you for all you've done! [17:04:18] * andrewbogott -> long lunch break [17:09:20] dhinus: does the toolsdb replica have a `heartbeat_p` setup? With the recent questions about Quarry searches I'm wondering if I can extend https://replag.toolforge.org/ to also report on ToolsDB replication. (Or make a separate tool to do that) [17:11:05] bd808: yes it has heartbeat_p, it should work! [17:11:24] it's already working if you use quarry, quarry will show you the replay based on that [17:16:39] dhinus: Is there a `tools.db.svc.wikimedia.cloud` equivalent service name for the replica? [17:22:54] * dcaro off [17:22:56] cya tomorrow! [17:36:31] I figured out that `tools-readonly.db.svc.wikimedia.cloud` is the service name needed [21:05:25] cteam: I finally put the CSP consent tool proposal on wiki at https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Third-party_interaction_consent_tool. This has been a TODO for me for like 4 months... my brain was stuck on "is it ready to share?" and I finally got over that.