관리 메뉴

서랍장

[기술 면접] HDFS 본문

private/나의 공부방

[기술 면접] HDFS

소소한 프로그래머 2022. 11. 1. 21:16

Summary

  • HDFS is a scalable distributed file system for large, distributed data-intensive applications.
  • HDFS uses commodity hardware to reduce infrastructure costs.
  • HDFS provides APIs for usual file operations like create, delete, open, close, read, and write.
  • Random writes are not possible; writes are always made at the end of the file in an append-only fashion.
  • HDFS does not support multiple concurrent writers.
  • An HDFS cluster consists of a single NameNode and multiple DataNodes and is accessed by multiple clients.
  • Block: Files are broken into fixed-size blocks (default 128MB), and blocks are replicated across a number of DataNodes to ensure fault-tolerance. The block size and the replication factor are configurable.
  • DataNodes store blocks on local disk as Linux files.
  • NameNode server is the coordinator of an HDFS cluster and is responsible for keeping track of all filesystem metadata.
  • NameNode keeps all metadata in memory for faster operations. For fault-tolerance and in the event of NameNode crash, all metadata changes are written to the disk onto an EditLog. This EditLog can also be replicated on a remote filesystem (e.g., NFS) or a secondary NameNode.
  • The NameNode does not keep a persistent record of which DataNodes have a replica of a given block. Instead, the NameNode asks each DataNode about what blocks it holds at NameNode startup and whenever a DataNode joins the cluster.
  • FsImage: The NameNode state is periodically serialized to disk and then replicated, so that on recovery, a NameNode may load the checkpoint into memory, replay any subsequent operations from the edit log, and be available again very quickly.
  • HeartBeat: The NameNode communicates with each DataNode through Heartbeat messages to pass instructions and collect its state.
  • Client: User applications interact with HDFS through its client. HDFS Client interacts with NameNode for metadata, but all data transfers happen directly between the client and DataNodes.
  • Data Integrity: Each DataNode uses checksumming to detect the corruption of stored data.
  • Garbage Collection: Any deleted file is renamed to a hidden name to be garbage collected later.
  • Consistency: HDFS is a strongly consistent file system. Each data block is replicated to multiple nodes, and a write is declared to be successful only after all the replicas have been written successfully.
  • Cache: For frequently accessed files, the blocks may be explicitly cached in the DataNode’s memory, in an off-heap block cache.
  • Erasure coding: HDFS uses erasure coding to reduce replication overhead.

System design patterns

Here is a summary of system design patterns used in HDFS.

  • Write-Ahead Log: For fault tolerance and in the event of NameNode crash, all metadata changes are written to the disk onto an EditLog which is a write-ahead log.
  • HeartBeat: The HDFS NameNode periodically communicates with each DataNode in HeartBeat messages to give it instructions and collect its state.
  • Split-Brain: ZooKeeper is used to ensure that only one NameNode is active at any time. Fencing is used to put a fence around a previously active NameNode so that it cannot access cluster resources and hence stop serving any read/write request.
  • Checksum: Each DataNode uses checksumming to detect the corruption of stored data.

'private > 나의 공부방' 카테고리의 다른 글

[기술 면접] Kafka  (3) 2022.11.01
분산 시스템이란  (0) 2022.10.30
Comments