Memo — Layered Decentralized Data Storage Infrastructure
1. Selfish data
2. The world is layered
3. Memo technical solutions
3.1 Proof of Storage
3.2 Random Probability Check
3.3 RAFI and data recovery
3.4 Order Pool Payment
It has been ten years since the invention of the blockchain. However, there are many constraints in its development. Although BTC and ETH have enabled the value storage and the issuance and settlement of assets, the necessary infrastructure is still missing in the decentralized world. Insufficient infrastructure today makes our blockchain application a beautiful garden in the sky, confined in a small circle.
What MemoLabs wants to resolve is the storage problem in decentralized infrastructure. Although we already have storage systems like Filecoin, Arweave, and storj, Memo still aspires to bring new plans to the storage lane with its unique and innovative architecture and design. Memo adopts a layered design, using a highly efficient and verifiable proof of storage to ensure the reliability of data storage, ease the burden of on-chain settlement by pooling together all orders.
Evolution is the most primitive means of information production. Genetic information competes between populations, striving to make it outlast the other. The carrier of old information will reproduce itself through self-replication before extinction, and in the process, new information will be generated.
Data storage shares some similarities with gene replication. Only valuable data will perpetuate towards the remote future, as only populations that adapt to environmental changes can perpetuate.
To perpetuate, the selfish gene has to replicate. To prevent a single point failure, a population needs to maintain a sufficient count of individuals, and the population that carries all genes must continue to reproduce and attain new individuals to improve the redundancy of gene storage. However, each species has a limited lifespan. Only through continuous reproduction, can selfish genes survive. But this does not mean that all genes can survive hundreds of millions of years. Those populations that cannot adapt to environmental changes will die out, and their genes will go extinct. Due to the limited capacity in the entire ecosystem, the cruel extinction can be also viewed as a random value assessment of genes in a sense.
Memo treats data as genetic evolution. Data storage also needs to ensure sufficient redundancy. The higher the redundancy, the securer the storage. Storage hardware is the information carrier. They have more or less a limited lifespan and will eventually die out. Memo will first decentralize data to nodes through encoding for storage, and it will recover lost data pieces by keeping the redundancy stable.
Storage nodes, as an ‘individual’ of a ‘population’, are incentivized by storage fees to save data. Memo has the highest degree of dispersed nodes all over the world, as a small population is highly likely to go extinct.
Data exist at a cost. Before reaching the end of the order, it can be renewed for a longer period by reprocessing its redundancy when it is deemed by some as valuable. This is similar to the reproduction of the population.
2. The world is layered
Before the launch of Filecoin, Protocol Labs is renowned for its development of widely applied Libp2p, IPFS, and other libraries. Then Filecoin was launched to solve the incentive problem of IPFS storage. However, Filecoin is rarely used and reduced to a pure mining bubble.
In the design of the protocol, Filecoin innovatively invented the Proof-of-Replication and the Proof-of-Spacetime and bound the data storage with the security consensus in the entire Filecoin blockchain network. This enabled the Filecoin network to grow rapidly in the early stage. However, it also became the bottleneck of Filecoin’s development at later stages.
Filecoin adopts a simple order to describe the supply and demand matching of file storage. Regardless of the data size, if not actively aggregated, an order with a set price and storage time will be generated on the chain. Filecoin will not actively copy data. Instead, to store many copies, you will find a few storage nodes to sign an order. In case of a lost copy, there will be no node recovering data, and only the node that lost the data will be penalized.
Although the above option solves some problems, it has caused more problems. By linking the Proof of Replication and the Proof of Spacetime with the block generation, it forces Filecoin to combine the storage module with the consensus module. On public chains, assets security is far more important than the availability and security of data storage. Therefore, it is necessary to improve consensus security, while at the same time reducing the availability of data storage. Although Filecoin aims at storage, the design of storage is constrained by consensus requirements everywhere.
The high-security parameters required for the Proof of Replication are reflected in the extremely high hardware configuration. The machine participation threshold of the Filecoin storage network far exceeds the cost of general storage hardware. Most machine costs in a storage network occur in the hardware such as high-performance CPUs and graphics processors, not storage hardware such as hard drives. So the actual storage cost of Filecoin is much higher than the usual storage option, and can only rely on the mining bubble to offset the cost.
On the other hand, an encoding process is required for data storage in the Filecoin network, and a decoding process is required to access data. Due to the security parameter settings of the Proof of Replication, it takes an hour or two to complete data encoding and decoding even on a high-performance CPU. Therefore, the access availability on Filecoin is extremely low.
Although in some form, Memo also relies on the verification of orders and storage nodes, it does not couple everything together like Filecoin. Almost all computer systems are layered, and the compulsive coupling design will complicate the system and create bottlenecks for its performance. Memo follows the layered approach in its entire model design.
Memo is divided into a settlement layer, verification layer, and storage layer. All orders will be settled and aggregated into a pool to drive payment. The verification layer is responsible for verifying the reliability of the storage layer. The verification layer involves many groups. Each group is responsible for a batch of storage nodes. The nodes of the storage layer are responsible for storing data and submitting the Proof of Storage to the verification layer. Meantime, the settlement layer accordingly collects and aggregates revenue for storage nodes to pick up anytime.
This layered design enables the storage layer to focus on the design of storage without coupling with consensus. The layered approach is more aligned with the design of the actual system. For example, the technology stack of TCP/IP is layered, and so is the general computer system storage layered with registers, first-level cache, second-level cache, third-level cache, memory, and hard drive.
The layered approach can ensure the security of high-value order information in the settlement layer. This approach also allows the verification layer to verify the storage layer without consuming the resources on the settlement layer. This cannot only improve the utilization efficiency on the settlement layer but also provide highly available services on the storage layer.
Moreover, after decoupling storage from consensus, Memo can adopt simpler and more efficient Proof of Storage required. At the same time, combined with probabilistic random checks, it can achieve extremely high availability at very low verification costs.
3. Memo technical solution
On the one hand, a reasonable economic model is required to incentivize participants to provide services in the encryption infrastructure. On the other hand, technical means are needed to prevent participants from wrongdoing. The combination of encryption and economy creates a unique economy of encryption. Memo is designed based on the encryption economic road map.
3.1 Proof of Storage
The decentralized storage network relies on storage nodes on the Internet to provide services. Storage nodes may delete data, disconnect, or damage hard drives. It is necessary to prevent active malicious behavior on storage nodes, detect occasional data loss, and recover data in time. So it is imperative to design economic mechanisms and verification schemes to ensure actual data is stored on nodes.
Filecoin adopts the Proof of Replication and the Proof of Spacetime to prevent wrongdoing on storage nodes, while Arweave adopts the Proof of Access (PoA) to ensure truthful store data storage by miners.
Memo employs a more lightweight and efficient Proof of Storage algorithm, which combines the BLS signature aggregation, vector pledges and other mechanisms. Regardless of data size, all needed is a few hundreds of bytes constant-level communication overhead and a few milliseconds verification overhead. This can ease the burden of validating data at the verification layer to afford larger-scale data storage.
Regardless of data size, the Proof of Storage algorithm can always aggregate proofs into a fixed size. Memo can support a larger data scale while keeping verification costs low. The algorithm is publicly verifiable, so anyone can verify the Proof of Storage.
3.2 Random Probability Check
To reduce the proof generation cost, when verifying data, there is no need to challenge all data pieces. Only a random probability check on partial data pieces is required to ensure the security of all data pieces.
If the total number of data pieces is n, the number of blocks challenged at one time is c, the number of damaged or deleted data pieces is t, x represents the number of data pieces challenged in the damaged t block, then the probability of having at least one of challenged blocks in the damaged data block is P(X).
According to the calculation, the following formula is established:
According to the formula, if the proportion of lost data is 1%, then 460 blocks and 300 blocks are challenged, the PX of data loss in each case is at least 99% and 95% respectively. To find 0.1% of data with a 99.9% probability, it only needs to challenge 6905 data pieces.
The random probability check is not a one-off practice. It should be adjusted over time. In case of data loss undetected with a very small probability, after probability checks of a life cycle, the probability of being detected is indefinitely close to 1 and ensures safe data storage at a lower cost.
3.3 RAFI and data recovery
The penalty mechanism of the Filecoin only restricts the online status of storage nodes by pledging Fil tokens before storing the sealed data. The pledged tokens will be deducted in case of inaccurate data generation. However, no data recovery mechanism is available on Filecoin to sustain the redundancy of data copies in case of data loss. However, even if there is no active wrongdoing on storage nodes, storage hardware failure is possible. This will increase costs unless autonomous redundancy is available on storage nodes.
Memo naturally supports two redundancy mechanisms: multi-copies and erasure coding. If a storage node fails to submit a Proof of Storage and the duration exceeds the failure confirmation time, other nodes can recover data through redundancy and obtain the subsequent earnings from the storage order.
Speedy data recovery is required for data reliability. Therefore, a shorter confirmation time is needed. However, the failure confirmation time should be extended to avoid misjudgment and reduce network traffic consumption. Previous recovery mechanisms determine a middle way to confirm the failure.
Erasure coding will bring a large recovery penalty. It means multiple data blocks need to be transmitted to recover a data block. Compared with the multi-copies approach, it will multiply the recovery cost. If a higher erasure code parameter is adopted, such as 28–52 erasure, then it requires accessing 28 data blocks to recover a data block. This will generate higher penalties for WLAN decentralized storage.
By comparison, Memo adopts RAFI technology, a risk-aware failure confirmation mechanism to reduce recovery penalties. RAFI dynamically adjusts the failure confirmation time according to the risk of data loss. For low-risk data, the failure confirmation time is greatly extended to reduce network traffic consumption, while for high-risk data, the failure confirmation time is shortened to improve reliability.
For example, for multi-copies data, it is safer to lose only one out of five copies than to lose two. For erasure coding such as 28–52 erasure, the security level is still very high if there is only one missing data block. The failure confirmation time can be very long. If there are 30 missing data blocks, the failure confirmation time needs to be shortened for speedy recovery and improved reliability.
RAFI counts the risk as a consideration in its data recovery. By doing so, Memo can greatly reduce the network traffic occupied by data restoration while improving data reliability.
3.4 Order Pool Payment
In the Filecoin network, an order is signed by a single user and a single miner. This leaves the order processing capacity in the entire network limited by the throughput of the chain. Memo innovatively uses an order pool design to aggregate many orders in the settlement layer and the respective orders of users and storage nodes at the same time, greatly enhancing the order processing capability of the settlement layer.
In Memo, the detailed orders will be stored on the verification layer and aggregated on the settlement layer. For the storage nodes that sign orders with each user, all orders will be aggregated into one single order. The weighted storage unit price will be calculated on the combined data storage capacity and the storage unit price of all orders. The storage revenue over time intervals will be calculated on the basis of the weighted storage unit price.
Every time a user signs an order with a storage node, the amount on the order will first be transferred to the exclusive funding pool of the storage node. Then the storage node submits proofs to the verification layer. Once verified, the storage node will receive the invoice for this period. The revenue can be settled from the funding pool on the settlement layer. Risks will be prevented when funding pools on different storage nodes are separated from each other.
All revenue earned by a storage node will be aggregated into a withdrawal order for universal collection from the funding pool. Therefore, the resources consumed on the settlement layer to pay off storage nodes can only be affected by the number of storage nodes engaged. Also, cumulative withdrawal is available for cost-saving purposes. Revenues accrued from many cycles on the storage node can be collected cumulatively to further ease the burden on the settlement layer.