Title: From Distributed Data Networks to AI Grand Models

7 min readMar 1, 2024

Title: From Distributed Data Networks to AI Grand Models

I. Introduction

Data bottlenecks faced by traditional AI models with the continuous development of artificial intelligence technology, the application of AI algorithms such as deep learning in various fields is also booming. However, the optimization and upgrading of AI models need to rely on massive data to support, which has a huge demand for data. Traditionally, the acquisition and storage of data is a bottleneck in the development of AI. Most open data collectives are limited in quantity, and the internal data of enterprises is difficult to share and utilize due to security and privacy issues. In addition, enterprises are faced with personnel turnover and decision-making changes in development, such as the high-level “infighting” of OpenAI some time ago. All these may bring great difficulties to the development of AI models.

MEMO decentralized data Network

MEMO is a DePIN network for decentralized file storage based on the self-developed MEFS Distributed Data protocol. DHT technology is mainly used to realize decentralized point-to-point positioning and searching, and files are encoded and fragmented by Merkle tree, and then distributed storage is carried out by the redundant mode of multi-copy + erasure code. Unlike other storage systems such as IPFS, MEMO directly supports high-performance file system interfaces and provides security features such as encryption. Users can securely store and access files via the MEMO network, and permanent storage is supported.

In the response AI model, the data stored in MEMO will be fragmented encrypted, and dispersed to multiple nodes in the network, thus solving the single point problem of central storage, space cost problem, anti-censorship problem, data privacy problem, etc.

2. Data sources and challenges of AI model

We know that the foundation of AI models is data networks (followed by algorithms and finally, computing power). With a huge data network, there is a reserve of knowledge and the training data of AI models mainly comes from three areas:

First, open data sets, mainly from some government agencies, professional data institutions, and so on. But this kind of data content is single type limited, such as picture data set content is single, lack of support for vertical industry model training. Some industry expertise is difficult to obtain through open data sets, and open data sets have poor support for knowledge in the multi-source encryption field. At the same time, the format standards of open data sets in different fields are not uniform, and additional development is needed to transform them.

The second is the massive structured data accumulated within the enterprise, such as transaction data, user behavior data, and other data with strong regularity. However, it is difficult to disclose and share these data, mainly because of privacy issues, and more importantly, it is difficult for different enterprises to achieve the unification of data formats and standards, so additional conversion costs must be invested.

Third, user-generated UGC content such as pictures, voice, and text. However, this kind of data is large but difficult to collect and integrate, the main difficulty is that the collection channel is single and difficult to cover global users, and the data format standards of different UGC platforms are different and the compatibility is high. In addition, image-type UGC content faces the problem of copyright declaration, which is difficult to fully integrate and utilize, and natural language UGC data cannot be automatically marked semantic.

3. the problem of traditional storage methods

At present, AI model training data is mostly stored on internal servers of enterprises, and due to the huge amount of data, it has brought a series of problems:

The first is the natural risk, the centralized storage method will store a large amount of data on the internal server of the enterprise, once the server fails or goes down, it will lead to timely access or recovery, and even a large number of data loss, which seriously affects the normal operation of the business.

Second, private data is stored in a single place. Once it is not encrypted or has not taken effective security measures, it will face the risk of being attacked by hackers or leaked by insiders. For example, the business secrets of enterprises and the privacy of customers may be lost or abused due to data leakage, which will bring serious economic losses and reputation risks to enterprises and individuals.

Third, because the data is stored on different servers, there is a lack of unified data formats and standards, which often requires complex data aggregation and transformation to achieve unified data management and analysis. As a result, it is difficult for institutions to share and cross-utilize data sets, and realize cross-industry and cross-institution data exchange and cooperation, which limits the comprehensive utilization value of data.

4. The solution of MEMO distributed data network

At the beginning of the project, MEMO took data security as its design concept and built a series of data security technologies based on blockchain. Today, MEMO has a certain strength in secure storage. To solve the difficult problems of AI model data, MEMO’s solutions are mainly reflected in the following aspects:

1. Develop a unified ecological standard

MEMO distributed storage system provides unified data protocol and ecological standards automatically slices training data distributes it according to hash rules during storage, and supports multiple formats, which greatly improves data compatibility and does not require additional conversion costs for data calls, effectively reducing cross-utilization barriers. However, the consensus and traceability system of blockchain can well guarantee the real validity of training data, effectively eliminate false information, and ensure the real-time update of data.

2. Data encryption and copyright authentication

MEMO supports file-level encryption technology, and realizes double security protection during transmission and storage, completely solving the problem of privacy leakage. Under the protection of the encryption mode, enterprise data can enjoy encryption protection while being shared, and users can also have copyright protection through the encryption system to safeguard their interests.

3. Ensure data integrity

MEMO stores data scattered across hundreds of nodes, each of which stores a partial copy of the data. The storage architecture of MEMO is highly fault-tolerant. Even if some nodes fail or go offline, the system can still maintain the integrity and availability of training data, effectively solving the risk of single point of failure.

4. Improve data rights and interests of all parties

MEMO data transmission adopts the end-to-end P2P transmission mechanism, which can realize data sharing between different institutions, and establish the corresponding incentive mechanism to encourage individuals and enterprises to participate in data sharing. Individuals and enterprises can gain financial returns by sharing their data or providing storage space. This end-to-end data-sharing mechanism effectively breaks the traditional data barrier and realizes the free flow of data and the interconnection of value.

The data network, the basis of AI model training, is introduced above. In terms of algorithms and computing power supporting AI model operation, MEMO also has certain advantages:

Algorithm: The algorithm mode of MEMO can provide a more secure, trusted, and self-controlled computing environment for the AI model, and provide an encryption guarantee for the AI model. In terms of model parameters, MEMO has a built-in security guardrail to effectively prevent system abuse or malicious operation. The AI model can interact with the algorithms in MEMO, such as using smart contracts to perform tasks, verify data, and execute decisions.

Computing power: The decentralized computing resources of MEMO can provide the AI model with high-performance computing power. The AI model can use the distributed computing resources in MEMO for AI training, data analysis, and prediction. By distributing computing tasks to multiple nodes on the network, AI training can speed up calculations and process larger amounts of data.

5. MEMO helps develop typical applications of AI

MEMO not only provides an efficient and secure data storage and management solution but also opens up new possibilities for the development of AI. With the help of MEMO, typical applications of AI will also become a reality, from large pre-trained model libraries to deep learning model development in industry verticals to user participation in the AI community, MEMO has great potential to promote the development of AI.

1. Construction of large model pre-training model library

With its massive storage space, MEMO can be used to build a publicly searchable multi-source hybrid training library. This means that various types of pre-trained models can be stored on the MEMO network and provided with external query services. Such a model library will become a valuable resource in the AI field, which can provide developers with various types of models and provide a wider range of choices and support for their projects.

2. Deep learning model development for industry verticals

Industries can share quality data produced in subdivided fields through MEMO to develop deep learning models for industry verticals. The data storage and sharing services provided by MEMO provide a more convenient way for inter-industry data exchange, so that different industries can share data resources and promote the development and application of the model.

3. Users participate in the AI community

MEMO empowers users to jointly generate and utilize AI content to realize Internet productization. Through the MEMO network, users can participate in the creation, training, and optimization of AI models, and jointly build an AI community. By sharing their data or participating in the training and testing of the model, users can get financial returns or other forms of incentives, thus promoting the popularization and development of AI technology.

6. Future direction of decentralized data networks

As AI is becoming increasingly popular, MEMO, as a major player in distributed data networks, will focus on the research and development of distributed AI large models, and promote the construction of device learning algorithms and frameworks.

The distributed storage network will accelerate the creation of industry data standards, integrate multiple mixed models, establish a more complete and open infrastructure, build a distributed AI service platform, and promote the wide application of AI technology in various industries and fields.

At the same time, with the gradual increase in users’ participation in co-building applications, the AI industry will usher in a wave of change. Users will become an important part of the AI industry ecosystem, and their participation will promote the innovation and development of AI technology, form a complete ecology from distributed data networks to AI large models, promote the rapid growth of the digital economy, and build a more prosperous economic ecology for the AI industry.

Written by Memo Labs