Repair-Efficient Codes for Fault-Tolerant Distributed Storage

Rateless coding protocol to reduce complexity, repair inefficiencies and enhance reliability

Computing storage
Source: https://stock.adobe.com/uk/302854234

Background

As the information world experiences a data explosion, a strong demand for massive‑scale fault‑tolerant storage systems is being created. Data loss prevention is a critical requirement across sectors. Such data loss could occur through different means including but not limited to unintentional actions, malicious activity such as virus and malware attacks, mechanical damage and even natural disasters. 

In order to guarantee data reliability, most enterprises create 3 distinct copies of all data to be stored and distribute these over different storage servers. This amounts to a 200% overhead in storage drives alone, not to mention additional management, resulting in a very high cost of ownership. 

Erasure codes have more efficient disk-space utilization, however, many enterprises refrain from deploying these techniques because they require CPU-intensive computations that introduce latency during read/writes. They also need to contact several drives to repair a single failed drive, causing input/output bottlenecks. 

Maximum Distance Separable (MDS) codes are better at balancing reliability with storage cost but introduce complexity problems and repair inefficiency. MDS codes also incur high computational costs that reduce their feasibility for deployment in production systems.

Technology Overview

This technology solves multiple problems in distributed storage systems by providing systematic encoding, reduced repair locality, reduced encoding/decoding complexity and enhanced reliability compared to conventional approaches. The method is also particularly suitable for the distributed storage of large files to avoid drive capacity underutilization.

The underlying technology is an erasure code construction for distributed storage based on the theory of rateless codes. The design utilizes features such as sparse encoding and low-complexity decoding to solve the repair problem and implementation complexity issues. It achieves encoding and decoding times that scale linearly or logarithmically with the data length, and offers systematic encoding which delivers fast retrieval times in the absence of failures. 

The method formulates a multi-objective optimization problem considering the average code symbol repair locality, average edge complexity, and probability of decoding failure simultaneously.

Benefits

When implemented, this technology provides superior distributed storage solutions with significant advantages, including:

  • Attractive low repair I/O cost;
  • Reduced encoding/decoding complexity;
  • Computationally efficient implementation resulting from encoding/decoding times that scale linearly or logarithmically with the data length;
  • Systematic encoding which implies fast retrieval times in the absence of failures; and
  • Achieves a data loss probability of 10‑5 with only 10‑20% higher storage overhead than existing non‑replicating techniques. 

Applications

With low locality and systematic form, the code construction is a superior candidate for production systems. To maximize performance and financial benefits, the code could be used for large file storage to avoid drive capacity underutilization, and/or deploying the code in a tiered service structure to take advantage of the trade-off between storage overhead and different levels of reliability. 

Opportunity

Queen’s University is seeking companies interested in licensing, implementing and/or commercializing this technology. 

Patents

IP Status

  • Patented
  • Patent application submitted

Seeking

  • Licensing
  • Commercial partner

Posted

November 4, 2022