Zookeeper源码分析之连接建立深入剖析分布式系统中协调服务的网络通信协议连接建立流程会话管理机制故障处理策略以及性能优化方法

1. 引言

Apache ZooKeeper是一个为分布式应用提供高性能、高可用、且具有严格顺序访问控制能力的分布式协调服务。它被广泛应用于分布式系统中，用于命名服务、配置管理、集群管理、分布式锁等场景。在ZooKeeper的架构中，客户端与服务器之间的连接建立是整个系统运行的基础。本文将深入剖析ZooKeeper源码中连接建立的相关机制，包括网络通信协议、连接建立流程、会话管理、故障处理以及性能优化等方面。

2. Zookeeper网络通信协议概述

ZooKeeper使用自定义的协议进行客户端与服务器之间的通信。这个协议基于TCP/IP协议栈，设计上考虑了低延迟和高吞吐量的需求。

2.1 协议基础

ZooKeeper通信协议主要由两部分组成：请求(Request)和响应(Response)。每个请求和响应都包含一个固定长度的协议头和一个可变长度的主体部分。

在源码中，协议头由org.apache.zookeeper.server.quorum.QuorumPacket类定义：

public class QuorumPacket implements Record { private int type; // 消息类型 private long zxid; // 事务ID private byte[] data; // 数据内容 private long epoch; // 领导者选举时代 // 序列化和反序列化方法 public void serialize(OutputArchive archive, String tag) throws IOException { archive.writeInt(type, "type"); archive.writeLong(zxid, "zxid"); archive.writeBuffer(data, "data"); archive.writeLong(epoch, "epoch"); } public void deserialize(InputArchive archive, String tag) throws IOException { type = archive.readInt("type"); zxid = archive.readLong("zxid"); data = archive.readBuffer("data"); epoch = archive.readLong("epoch"); } }

2.2 通信协议类型

ZooKeeper定义了多种通信协议类型，在org.apache.zookeeper.ZooDefs.OpCode中定义：

public interface OpCode { boolean notification = 0; boolean create = 1; boolean delete = 2; boolean exists = 3; boolean getData = 4; boolean setData = 5; boolean getACL = 6; boolean setACL = 7; boolean getChildren = 8; boolean sync = 9; boolean ping = 11; boolean getChildren2 = 12; boolean check = 13; boolean multi = 14; boolean auth = 100; boolean setWatches = 101; boolean sasl = 102; boolean createSession = -10; boolean closeSession = -11; boolean error = -1; }

2.3 协议序列化

ZooKeeper使用Jute进行协议的序列化和反序列化。Jute是Hadoop项目中的序列化框架，也被ZooKeeper采用。下面是一个请求对象的序列化示例：

public class RequestHeader implements Record { private int xid; // 客户端请求ID private int type; // 请求类型 public void serialize(OutputArchive archive, String tag) throws IOException { archive.writeInt(xid, "xid"); archive.writeInt(type, "type"); } public void deserialize(InputArchive archive, String tag) throws IOException { xid = archive.readInt("xid"); type = archive.readInt("type"); } }

3. 连接建立流程详解

ZooKeeper客户端与服务器之间的连接建立是一个复杂的过程，涉及多个步骤和状态转换。

3.1 客户端连接初始化

客户端连接的初始化主要在org.apache.zookeeper.ZooKeeper类中完成。当创建一个ZooKeeper实例时，会启动一个客户端连接线程：

public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher, boolean canBeReadOnly) throws IOException { // ... 参数校验和其他初始化代码 ... // 创建并启动ZooKeeper客户端对象 cnxn = new ClientCnxn(connectStringParser.getChrootPath(), hostProvider, sessionTimeout, this, watchManager, getClientCnxnSocket(), canBeReadOnly); // 启动客户端连接线程 cnxn.start(); }

3.2 ClientCnxn类分析

ClientCnxn是管理客户端连接的核心类，它包含两个主要线程：

SendThread：负责发送请求并处理服务器响应
EventThread：负责处理事件并通知注册的Watcher

public class ClientCnxn { private final SendThread sendThread; private final EventThread eventThread; public ClientCnxn(String chrootPath, HostProvider hostProvider, int sessionTimeout, ZooKeeper zooKeeper, ClientWatchManager watcher, ClientCnxnSocket clientCnxnSocket, boolean canBeReadOnly) { // ... 初始化代码 ... this.sendThread = new SendThread(clientCnxnSocket); this.eventThread = new EventThread(); } public void start() { sendThread.start(); eventThread.start(); } }

3.3 连接建立过程

连接建立过程主要在SendThread中实现，下面是连接建立的核心代码：

private void startConnect() throws IOException { // 如果当前有连接，先关闭 if (clientCnxnSocket.isConnected()) { clientCnxnSocket.disconnect(); } // 状态设置为连接中 state = States.CONNECTING; // 获取下一个服务器地址 InetSocketAddress addr; if (rwServerAddress != null) { addr = rwServerAddress; rwServerAddress = null; } else { addr = hostProvider.next(1000); } // 记录日志 log.info("Opening socket connection to server " + addr); // 设置名称和开始时间 setName(getName().replaceAll("\(.*\)", "(" + addr.getHostName() + ":" + addr.getPort() + ")")); startConnectMS = System.currentTimeMillis(); // 初始化连接 clientCnxnSocket.connect(addr); }

3.4 会话创建

连接建立后，客户端需要创建会话。会话创建通过发送ConnectRequest实现：

public void connect(InetSocketAddress addr) throws IOException { // 创建socket连接 sockKey = sockChannel.register(selector, SelectionKey.OP_CONNECT); // 如果是第一次连接，发送连接请求 if (primingConnectionString) { primingConnectionString = false; long sessionId = 0; byte[] passwd = new byte[16]; Random r = new Random(); r.nextBytes(passwd); // 创建连接请求 ConnectRequest conReq = new ConnectRequest(0, lastZxid, sessionTimeout, sessionId, passwd); // 序列化并发送 ByteArrayOutputStream baos = new ByteArrayOutputStream(); BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos); boa.writeInt(-1, "len"); // 先写入一个占位长度 conReq.serialize(boa, "connect"); baos.close(); byte[] lenBuffer = baos.toByteArray(); ByteBuffer bb = ByteBuffer.wrap(lenBuffer); bb.putInt(bb.capacity() - 4); // 写入实际长度 sockChannel.write(bb); } }

3.5 连接状态转换

ZooKeeper客户端连接有多种状态，在org.apache.zookeeper.ZooKeeper.States中定义：

public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; // 判断是否处于活跃状态 public boolean isAlive() { return this == CONNECTED || this == CONNECTEDREADONLY; } // 判断是否处于连接状态 public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } }

连接状态转换图如下：

NOT_CONNECTED -> CONNECTING -> ASSOCIATING -> CONNECTED/CONNECTEDREADONLY ^ | | | | v v v +------------ CLOSED <--- AUTH_FAILED <--- ANY_STATE (EXCEPTION)

4. 会话管理机制分析

ZooKeeper的会话管理是保证分布式系统一致性的关键机制之一。会话(Session)是客户端与服务器之间的一种关联关系，通过会话，客户端可以保持与服务器之间的状态同步。

4.1 会话结构

在ZooKeeper服务器端，会话由org.apache.zookeeper.server.SessionTracker.SessionImpl类表示：

public interface SessionTracker { public static class SessionImpl implements Session { final long sessionId; int timeout; volatile long expireTime; public SessionImpl(long sessionId, int timeout) { this.sessionId = sessionId; this.timeout = timeout; this.expireTime = System.currentTimeMillis() + timeout; } // 判断会话是否过期 public boolean isClosing() { return expireTime <= System.currentTimeMillis(); } } }

4.2 会话ID生成

ZooKeeper使用特定的算法生成会话ID，确保全局唯一性。会话ID生成在org.apache.zookeeper.server.SessionTrackerImpl中实现：

public class SessionTrackerImpl extends Thread implements SessionTracker { private final ConcurrentHashMap<Long, SessionImpl> sessionsById = new ConcurrentHashMap<Long, SessionImpl>(); private final ExponentialBackoffEpoch sessionExpiryQueue; private final AtomicLong nextSessionId = new AtomicLong(1); // 生成下一个会话ID public long initializeNextSession(long id) { long nextSid = 0; // 低位部分使用时间戳 nextSid = (System.currentTimeMillis() << 24) >>> 8; // 高位部分使用服务器ID nextSid = nextSid | (id << 56); return nextSid; } // 创建新会话 public long createSession(int sessionTimeout) { long sessionId = nextSessionId.getAndIncrement(); // 初始化会话 SessionImpl session = new SessionImpl(sessionId, sessionTimeout); // 添加到会话映射 sessionsById.put(sessionId, session); // 添加到过期队列 sessionExpiryQueue.add(session); return sessionId; } }

4.3 会话超时与心跳

ZooKeeper通过心跳机制来维持会话活性。客户端定期发送心跳包(PING)给服务器，服务器收到后会更新会话的过期时间。

客户端心跳在SendThread中实现：

private void clientCnxnSocket(ClientCnxnSocket clientCnxnSocket) { // 计算发送心跳的时间间隔 int timeToNextPing = readTimeout / 2 - (int)((System.currentTimeMillis() - lastPingSent) / 1000); // 如果到了发送心跳的时间 if (timeToNextPing <= 0) { sendPing(); lastPingSent = System.currentTimeMillis(); } } private void sendPing() { // 创建PING请求 RequestHeader h = new RequestHeader(); h.setType(ZooDefs.OpCode.ping); // 序列化请求 ByteArrayOutputStream baos = new ByteArrayOutputStream(); BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos); boa.writeInt(-1, "len"); h.serialize(boa, "header"); baos.close(); byte[] lenBuffer = baos.toByteArray(); ByteBuffer bb = ByteBuffer.wrap(lenBuffer); bb.putInt(bb.capacity() - 4); // 发送请求 clientCnxnSocket.sendPacket(bb); }

服务器端处理心跳在org.apache.zookeeper.server.NIOServerCnxn中实现：

public void processPacket(RequestHeader h, ByteBuffer incomingBuffer) throws IOException { // 处理PING请求 if (h.getType() == ZooDefs.OpCode.ping) { // 更新会话过期时间 serverStats.updateLatency(requestCreateTime); lastOp = "PING"; // 更新会话超时时间 updateStatsForResponse(h.getXid(), 0, lastOp, lastCxid, System.currentTimeMillis() - requestCreateTime); // 发送响应 sendResponse(h, new ReplyHeader(h.getXid(), 0, 0), null, null); return; } // ... 其他请求处理 ... }

4.4 会话清理

当会话超时或客户端显式关闭会话时，服务器会进行会话清理：

public void run() { try { while (running) { long waitTime = sessionExpiryQueue.getWaitTime(); if (waitTime > 0) { Thread.sleep(waitTime); continue; } // 获取过期会话 SessionImpl session = sessionExpiryQueue.remove(); if (session == null) { continue; } // 从会话映射中移除 sessionsById.remove(session.sessionId); // 通知监听器会话已关闭 for (SessionListener listener : sessionListeners) { listener.sessionClosed(session.sessionId); } } } catch (InterruptedException e) { // 处理中断 } }

5. 故障处理策略

ZooKeeper作为一个高可用的分布式协调服务，必须具备完善的故障处理机制。下面我们分析ZooKeeper在连接建立过程中的故障处理策略。

5.1 连接超时处理

客户端在连接建立过程中可能会遇到超时问题，ZooKeeper通过重试机制来处理：

private void connectToOneServer(String host, int port) { // 记录开始连接时间 long startMillis = System.currentTimeMillis(); try { // 尝试连接 sock = new Socket(); sock.setSoLinger(false, -1); sock.setSoTimeout(sessionTimeout); sock.connect(new InetSocketAddress(host, port), sessionTimeout); } catch (IOException e) { // 连接失败，关闭socket if (sock != null) { try { sock.close(); } catch (IOException e1) { LOG.warn("Ignoring exception during socket close", e1); } } // 检查是否超时 if (System.currentTimeMillis() - startMillis < sessionTimeout) { // 未超时，继续尝试连接下一个服务器 return; } // 超时，抛出异常 throw new SessionTimeoutException( "Client session timed out, have not heard from server in " + sessionTimeout + "ms for connection id: " + sessionId); } }

5.2 会话过期处理

当会话过期时，客户端会尝试重新建立连接并恢复会话：

private void conLossPacket(Packet p) { // 如果是会话建立请求 if (p.requestHeader.getXid() == CONNECT_XID) { // 通知事件线程连接丢失 eventThread.queueEvent(new WatchedEvent( Event.EventType.None, Event.KeeperState.Expired, null)); return; } // 其他请求，添加到待发送队列 waitingEvents.add(p); }

5.3 服务器切换机制

当客户端无法连接到当前服务器时，会尝试连接到集群中的其他服务器：

public InetSocketAddress next(long spinDelay) { // 获取当前服务器索引 currentIndex = (currentIndex + 1) % serverAddresses.size(); // 如果已经遍历完所有服务器，等待一段时间 if (currentIndex == 0) { try { Thread.sleep(spinDelay); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } // 返回下一个服务器地址 return serverAddresses.get(currentIndex); }

5.4 请求重试机制

对于某些类型的请求，ZooKeeper客户端会自动重试：

public void submitRequest(RequestHeader h, Record request, Record response, WatchRegistration watchRegistration) throws InterruptedException { // 创建请求包 Packet packet = new Packet(h, request, response, watchRegistration); // 如果是可重试的请求 if (h.getType() == ZooDefs.OpCode.createSession) { // 直接添加到队列 outgoingQueue.add(packet); } else { // 添加到待发送队列 waitingEvents.add(packet); } // 唤醒发送线程 sendThread.wakeup(); }

6. 性能优化方法

ZooKeeper作为一个高性能的分布式协调服务，在连接建立过程中采用了多种性能优化方法。

6.1 NIO通信模型

ZooKeeper使用Java NIO实现非阻塞IO，提高网络通信效率：

public class ClientCnxnSocketNIO extends ClientCnxnSocket { private Selector selector = Selector.open(); private SocketChannel sockChannel; // 初始化连接 void connect(InetSocketAddress addr) throws IOException { // 创建非阻塞socket通道 sockChannel = SocketChannel.open(); sockChannel.configureBlocking(false); sockChannel.socket().setSoLinger(false, -1); sockChannel.socket().setTcpNoDelay(true); // 注册连接事件 sockKey = sockChannel.register(selector, SelectionKey.OP_CONNECT); // 发起连接 sockChannel.connect(addr); // 初始化发送和接收缓冲区 sendBuffer = ByteBuffer.allocateDirect(1024 * 1024); recvBuffer = ByteBuffer.allocateDirect(1024 * 1024); } // 处理IO事件 void doIO(List<Packet> pendingQueue, LinkedList<Packet> outgoingQueue) throws IOException, InterruptedException { // 处理连接事件 if (sockKey.isConnectable()) { if (!sockChannel.finishConnect()) { throw new IOException("SOCK.connect failed"); } sockKey.interestOps(SelectionKey.OP_READ | SelectionKey.OP_WRITE); } // 处理读事件 if (sockKey.isReadable()) { int rc = sockChannel.read(recvBuffer); if (rc < 0) { throw new IOException("Socket closed"); } // 处理接收到的数据 readResponse(); } // 处理写事件 if (sockKey.isWritable()) { if (outgoingQueue.size() > 0) { // 获取下一个待发送包 Packet p = outgoingQueue.getFirst(); // 发送数据 p.createBB(); sockChannel.write(p.bb); // 如果发送完成，从队列中移除 if (!p.bb.hasRemaining()) { outgoingQueue.removeFirst(); // 添加到待响应队列 pendingQueue.add(p); } } } } }

6.2 批量发送机制

ZooKeeper客户端使用批量发送机制来减少网络IO次数：

private void doWrite(List<Packet> pendingQueue, LinkedList<Packet> outgoingQueue) { // 如果没有待发送的数据，直接返回 if (outgoingQueue.isEmpty()) { return; } // 获取当前待发送包 Packet p = outgoingQueue.getFirst(); // 如果包还没有创建缓冲区，创建它 if (p.bb == null) { p.createBB(); } // 发送数据 int bytesWritten = sockChannel.write(p.bb); // 如果发送完成 if (!p.bb.hasRemaining()) { // 从发送队列中移除 outgoingQueue.removeFirst(); // 添加到待响应队列 pendingQueue.add(p); // 如果还有待发送的数据，继续发送 if (!outgoingQueue.isEmpty()) { // 唤醒选择器 selector.wakeup(); } } }

6.3 缓冲区复用

为了避免频繁创建和销毁缓冲区带来的性能开销，ZooKeeper使用缓冲区复用技术：

public class ClientCnxnSocketNIO extends ClientCnxnSocket { // 直接缓冲区，用于发送和接收数据 private ByteBuffer sendBuffer; private ByteBuffer recvBuffer; // 初始化缓冲区 void initBuffers() { // 根据配置创建适当大小的缓冲区 int bufferSize = getClientConfig().getInt( ZKConfig.JUTE_MAXBUFFER, ZKConfig.JUTE_MAXBUFFER_DEFAULT); // 创建直接缓冲区 sendBuffer = ByteBuffer.allocateDirect(bufferSize); recvBuffer = ByteBuffer.allocateDirect(bufferSize); } // 重置缓冲区 void resetBuffers() { sendBuffer.clear(); recvBuffer.clear(); } }

6.4 异步IO模型

ZooKeeper客户端使用异步IO模型，使得发送请求和处理响应可以并行进行：

public class ClientCnxn { // 待发送队列 private final LinkedList<Packet> outgoingQueue = new LinkedList<Packet>(); // 待响应队列 private final LinkedBlockingDeque<Packet> pendingQueue = new LinkedBlockingDeque<Packet>(); // 发送线程 private class SendThread extends Thread { public void run() { while (true) { // 等待事件 int eventCount = 0; int timeout = clientCnxnSocket.getIdleTimeout(); try { // 等待IO事件或超时 eventCount = clientCnxnSocket.select(timeout); // 处理IO事件 if (eventCount > 0) { clientCnxnSocket.doIO(pendingQueue, outgoingQueue); } // 检查是否需要发送心跳 clientCnxnSocket.doTransport(pendingQueue, outgoingQueue); } catch (IOException e) { // 处理IO异常 } } } } // 事件线程 private class EventThread extends Thread { private final LinkedBlockingQueue<Object> waitingEvents = new LinkedBlockingQueue<Object>(); public void run() { try { while (true) { // 从队列中获取事件 Object event = waitingEvents.take(); // 处理不同类型的事件 if (event instanceof WatcherSetEventPair) { // 处理Watcher事件 WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error("Error while calling watcher ", t); } } } else if (event instanceof LocalCallback) { // 处理回调 LocalCallback lcb = (LocalCallback) event; lcb.run(); } } } catch (InterruptedException e) { // 处理中断 } } } }

6.5 连接池技术

为了提高连接建立效率，ZooKeeper实现了连接池技术：

public class ClientCnxnSocketNIO extends ClientCnxnSocket { // 连接池 private final ConcurrentHashMap<String, SocketChannel> connectionPool = new ConcurrentHashMap<String, SocketChannel>(); // 获取连接 SocketChannel getConnection(String host, int port) throws IOException { String key = host + ":" + port; // 如果连接池中已有连接，直接返回 SocketChannel channel = connectionPool.get(key); if (channel != null && channel.isConnected()) { return channel; } // 创建新连接 channel = SocketChannel.open(); channel.configureBlocking(false); channel.socket().setSoLinger(false, -1); channel.socket().setTcpNoDelay(true); channel.connect(new InetSocketAddress(host, port)); // 添加到连接池 connectionPool.put(key, channel); return channel; } // 关闭连接 void closeConnection(String host, int port) { String key = host + ":" + port; SocketChannel channel = connectionPool.remove(key); if (channel != null) { try { channel.close(); } catch (IOException e) { LOG.warn("Error closing connection to " + key, e); } } } }