I was working on figuring out what was going on with my instance and @vpzom@narwhal.city helped me figure it out.
One key thing as an admin is heading into your postgresql database and looking at the tasks. It seems like there's often good information in the tasks table as to what the system is thinking and why.
First, I logged into postgresql. I just started with:
psql
Then I connected to my database using:
\c lotide [username] 127.0.0.1 [port]
The default port is 5432, for reasons known only to the illuminati, my database is running at 5433.
It will ask you for your password, then you'll be logged in.
you can see information about the tables in the database by typing \d
. You can find information about a specific table by typing \d [tablename]
.
Initially I tried select * from task;
but it looked like it didn't work at first because the amount of characters in some of the columns are massive -- far larger than my screen so I was just scrollling and scrolling. When there is scrolling, postgresql is nice to you and gives you something you can scroll around with the arrow keys. In that case, the prompts changes to a : and you need to press q to quit that state and return to the normal sql prompt.
vpzom suggested I try SELECT id, state, latest_error FROM task ORDER BY created_at DESC;
which showed all the pending and completed requests.
Ultimately, I also tried select id, state,kind,latest_error,attempts,max_attempts from task where state='pending';
which helped me understand not just that there were pending transactions, but in some cases why I had pending transactions. After I resolved the issue, I could see 3 spots where the transactions were still pending, but the error message indicated it was due to the remote node misbehaving so there wasn't anything I could do about that.
One final thing: when I was looking at error messages, I started in the system log using journalctl -xe
. the only error messages I saw there were ones thrown by systemd. In order to see the error messages thrown by lotide itself, I had to use systemctl status lotide
which seems to track the messages that would be displayed on the command line.
In my case, the backend didn't seem to be sending any messages or even trying. The attempts field said '0', whereas the completed transactions said at least 1. When I looked at systemctl I noted that especially after a reboot the log seemed to be trying to constantly restart lotide but it was complaining that it was already running or there was some other problem. This suggested to me that lotide had a process running in the background that hadn't correctly exited. I added a killall lotide command to the script I use to start the backend and immediately the tasks started running.
As an additional prophylactic measure because I'm not a full-time system administrator and therefore can't be around to tinker with my servers when I'm not around my computer, I added a line to my crontab to restart the lotide service every day to ensure it won't hang up. Might be overkill, but it appears that it's a simple solution that doesn't really have any downsides. The restart takes seconds and the site doesn't go down if federation restarts, so it's low risk for the potential reward of not having any problems while I'm away.
jlj 2 years ago
Found this post quite helpful. Still can't get to the bottom of what's happened to my instance, though; I had a crash, and suspect my database is corrupt, even though PostgreSQL itself doesn't seem to be complaining.
lotide
doesn't seem to be logging anything right now; it opens up a connection to the remote database, but seems to just sit there; the last lines in the log, from hours ago are:I've made sure all instance of
lotide
are killed before these restarts, as I read that can also be a problem.Any troubleshooting tips would be greatly appreciated!
jlj 2 years ago
Never mind. Networking issue. :facepalm:
Love the project!