I've been running my own Misskey instance (mk.vulpes.one) for some time now without any major issues. However, since yesterday, when Misskey has been running for a while, most (not all!) requests it makes will suddenly fail with a timeout error:
This screenshot shows different kinds of requests failing: Mostly ones fetching posts from some instances, but you can also see the download of an emoji failing.
When the timeouts happen, the job queue builds up a lot of delayed jobs since most instances appear unresponsive to Misskey. I will also not see a lot of emojis from other instances in the frontend anymore since Misskey proxies them. Same goes for the link preview thumbnail and favicon.
I did a lot of investigation to make sure it isn't a silly issue on my end:
Restarting Misskey fixes the problem temporarily. At takes 1 hour at most before it starts happening again.
I confirmed that the issue only affects Misskey and not my whole server by doing a request to the same URLs with curl. You can see in this screenshot that curl gets the data just fine (and in well under 10 seconds):
Disabling fail2ban and my firewall did not fix the issue.
I did not run out of free disk space or RAM.
The error "network timeout at:" seems to be emitted by Node's HTTP/HTTPS Agent. This made me think that perhaps Node ran out of sockets for some reason, but I'm not sure how that would happen. In any case, I checked the file descriptor limit on my server and it's at 524288 by default. This should be plenty, as far as I know.
At this point, I have no idea what's going on. I would appreaciate your help since this issue has made my instance pretty much unusable for me. Thanks!