You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, @wenweihu86@loveheaven@guohao@wangwg1 , I discovered that the incorrect initialization logic in the SegmentedLog class prevents the node from restarting. Below, I will explain my findings in detail.
How to trigger this bug
As shown in Figure 1, this is a 3-node cluster (n1, n2, and n3 are not shown in the diagram). First, n1 experiences a timeout (Action 1), followed by n1 sending a vote request to n2 (Action 2). n2 processes n1's vote request and votes for n1 (Action 3), and n1 receives n2's vote to become the leader (Action 4). Then, n1 receives a ClientReq request from the client (Action 5) and writes the value 5 to its log. Finally, n1 restarts (Action 6). However, I found that no matter what, n1 is unable to restart, which led me to identify this bug.
Figure 1. An Example That Triggers the Bug.
Root cause
Next, I will explain why n1 is unable to restart. When n1 restarts, it runs the com.github.wenweihu86.raft.example.server.ServerMain#main method, which calls the com.github.wenweihu86.raft.RaftNode#RaftNode constructor to initialize the RaftNode class. This initialization method then calls com.github.wenweihu86.raft.storage.SegmentedLog#SegmentedLog to initialize the SegmentedLog class. The issue arises during the initialization of the SegmentedLog class. As shown in Figure 2, at this point, n1's data is stored in example1/data/log, which contains only a single open-1 file that holds the log entries (the log entry where the client wrote the value 5), but there is no metadata file. In contrast, the example2/data/log directory of n2, as shown in Figure 2, does contain a metadata file.
Figure 3 shows the SegmentedLog initialization code. For n1, the metadata file is null, and startLogIndexSegmentMap.size() > 0, which leads to the execution of throw new RuntimeException("No readable metadata file but found segments");. However, this RuntimeException is not handled, which ultimately causes n1 to crash. This is the root cause of n1's failure to restart.
Figure 2. Data on Disk During n1 Restart.
Figure 3. SegmentedLog Initialization Code.
The following log is a portion of the logs from n1 during the restart process, which further confirms that it was the unhandled RuntimeException thrown by SegmentedLog that ultimately caused the failure of n1's restart.
2025-02-11 19:19:21.465 [main] WARN c.g.w.raft.storage.SegmentedLog --- meta file not exist, name=./data/log/metadata
2025-02-11 19:19:21.466 [main] ERROR c.g.w.raft.storage.SegmentedLog --- No readable metadata file but found segments in ./data/log
Exception in thread "main" java.lang.RuntimeException: No readable metadata file but found segments
at com.github.wenweihu86.raft.storage.SegmentedLog.<init>(SegmentedLog.java:49)
at com.github.wenweihu86.raft.RaftNode.<init>(RaftNode.java:95)
at com.github.wenweihu86.raft.example.server.ServerMain.main(ServerMain.java:60)
Suggested fix
After identifying the root cause of the bug, fixing the issue is quite simple. The solution is to comment out the code that throws the RuntimeException during the SegmentedLog initialization, specifically throw new RuntimeException("No readable metadata file but found segments");. This check is flawed, as it incorrectly prevents a normal node like n1 from restarting.
Thank you for taking the time to read this. I'm looking forward to your confirmation, and would be happy to help fix the issue if needed.
The text was updated successfully, but these errors were encountered:
Hi, @wenweihu86 @loveheaven @guohao @wangwg1 , I discovered that the incorrect initialization logic in the SegmentedLog class prevents the node from restarting. Below, I will explain my findings in detail.
How to trigger this bug
As shown in Figure 1, this is a 3-node cluster (n1, n2, and n3 are not shown in the diagram). First, n1 experiences a timeout (Action 1), followed by n1 sending a vote request to n2 (Action 2). n2 processes n1's vote request and votes for n1 (Action 3), and n1 receives n2's vote to become the leader (Action 4). Then, n1 receives a ClientReq request from the client (Action 5) and writes the value 5 to its log. Finally, n1 restarts (Action 6). However, I found that no matter what, n1 is unable to restart, which led me to identify this bug.
Figure 1. An Example That Triggers the Bug.
Root cause
Next, I will explain why n1 is unable to restart. When n1 restarts, it runs the
com.github.wenweihu86.raft.example.server.ServerMain#main
method, which calls thecom.github.wenweihu86.raft.RaftNode#RaftNode
constructor to initialize the RaftNode class. This initialization method then callscom.github.wenweihu86.raft.storage.SegmentedLog#SegmentedLog
to initialize the SegmentedLog class. The issue arises during the initialization of the SegmentedLog class. As shown in Figure 2, at this point, n1's data is stored inexample1/data/log
, which contains only a singleopen-1
file that holds the log entries (the log entry where the client wrote the value 5), but there is no metadata file. In contrast, theexample2/data/log
directory of n2, as shown in Figure 2, does contain a metadata file.Figure 3 shows the SegmentedLog initialization code. For n1, the metadata file is
null
, andstartLogIndexSegmentMap.size() > 0
, which leads to the execution ofthrow new RuntimeException("No readable metadata file but found segments");
. However, thisRuntimeException
is not handled, which ultimately causes n1 to crash. This is the root cause of n1's failure to restart.Figure 2. Data on Disk During n1 Restart.
Figure 3. SegmentedLog Initialization Code.
The following log is a portion of the logs from n1 during the restart process, which further confirms that it was the unhandled
RuntimeException
thrown by SegmentedLog that ultimately caused the failure of n1's restart.Suggested fix
After identifying the root cause of the bug, fixing the issue is quite simple. The solution is to comment out the code that throws the RuntimeException during the SegmentedLog initialization, specifically
throw new RuntimeException("No readable metadata file but found segments");
. This check is flawed, as it incorrectly prevents a normal node like n1 from restarting.Thank you for taking the time to read this. I'm looking forward to your confirmation, and would be happy to help fix the issue if needed.
The text was updated successfully, but these errors were encountered: