#44 - Proving Data Integrity: A Comparative Analysis
Exploring Architecture and Data Verification on Tagion, Filecoin, and Celestia
Stanford Blockchain Review
Volume 5, Article No. 4
📚Author: W-SOURCE Research
🌟Technical Prerequisite: Advanced
Introduction
Since their inception, blockchains and cryptocurrencies have aimed to transform the financial landscape by offering broader access and removing intermediaries. Over time, the evolution of web3 has expanded their use cases, underscoring the potential of blockchain technology to create an internet where creators thrive and users maintain control over their data.
A key piece in enabling an infrastructure empowering end-users while maintaining decentralization is ensuring that data is stored in a resilient, censorship-resistant database. While convenient and familiar, centralized databases fail to provide the necessary security guarantees and require permission from the database owner, restricting global reach.
Distributed data storage systems satisfy the need for fault-tolerant, highly resilient storage by building on a network of nodes that stores, manages, and shares data. Eliminating the need for a central authority in addition to distributing data in a p2p fashion enhances security and transparency. Typically built on blockchain or similar tech, distributed storage systems tend to replicate data for redundancy and availability.
While distributed storage systems offer benefits such as security, data resilience, and potentially increased cost-effectiveness by relying on unused storage capacity, they face challenges regarding regulations and interoperability.
For a truly open and accessible web, distributed storage is indispensable. A crucial aspect of any storage system, then, is the question of how data is provably stored and maintained. The more important question for users is, how is data provably stored and maintained? This is addressed through data proofs.
In general, we differentiate between two types of proofs:
Deterministic Proofs: proofs that allow data holders to create a proof for a specific piece of data which demonstrates its creation, and matches the hash of the data. With this type of proof, only the hash generated using the data as input is publicly available.
Probabilistic Proofs: rely on probabilities to demonstrate a high likelihood that the underlying data is available. It's proof that rationally demonstrates a degree of certainty for a specific assumption in regard to data that has been published and is available for retrieval if necessary.
The rest of this article will discuss three different design choices for proving data storage and integrity in different systems. The first is Tagion, a DLT with a focus on high-volume and scalable data. Then, we will discuss how the decentralized storage network Filecoin ensures data is stored at scale. Finally, this article will look at Celestia, a blockchain solely focused on storing and making blockchain data available.
Tagion
Architecture
Tagion is a decentralized network for high-volume transactions on a mission to build a unique monetary system based on technology and democratic governance. The project relies on innovative database architecture and cryptography to reach a massive scale. It's not a blockchain but a distributed ledger that optimizes storage using a DART database. Tagion's proving mechanism is an example of a deterministic proof .
The DART database, at its core, functions as a distributed hash table where data is stored according to hash-key pairs. This structure organically grows more branches as more information is stored, supporting up to 256 combined archives and subbranches per branch [1].
In addition to resembling a distributed hash table, Tagion’s infrastructure can also be understood as a sparse Merkle tree (SMT). SMTs are authenticated data structures that operate on key-value pairs, supporting standard database operations like lookup, insertion, updates, and deletions. Each key-value pair represents a leaf, with the hash of the parent branch derived by recursively hashing child nodes up to the Merkle root.
SMTs significantly enhance efficiency by enabling a prover to verify the presence of an element without having to access unrelated data elements or download the specific data piece. Moreover, the independence of values in the tree allows updates in any order without altering the final structure of the tree.
Tagion’s system utilizes the root hash containing all sub-branch hashes to quickly verify data states with minimal computation. To further enhance processing capabilities, the system can create sub-DARTs for specific ecosystems, similar to sharded blockchains. These designated nodes manage subsets of data, increasing throughput and allowing the network to be tailored for various applications akin to app chains.
Using DART creates a stateless system, removing the need to maintain a complete history of the system's transitions. This means that data can be deleted, resulting in lower storage requirements overall and potentially increasing decentralization of the system by being lightweight.
Tagion employs HiBON (Hash Invariant Binary Notation Object) to facilitate the storage process further, which guarantees that data remains hash invariant upon entry, streamlining its retrieval based on the associated hash. Hash Invariant simply refers to the property that data will always create the same hash regardless of the order in which it is processed. It's a proven technology used to speed up retrieval and writing of data to a database [2].
With these mechanisms, Tagion not only securely stores data but also efficiently verifies its inclusion and integrity within the network.
Data Integrity
All subsystems in Tagion run so-called random walks, checking if data is stored and made available as necessary. Nodes that fail the challenge to verify retention will be excluded from the network.
All archives contain timestamps requiring payments to extend storage duration. On its walks, the system checks for payments and, if none has been received, will move on to delete the data, freeing up space.
Filecoin
Filecoin is a decentralized storage network that incentivizes miners with its native token, Filecoin, for providing storage capacity. To earn these rewards, miners must generate proofs that verify their storage capacity.
The fundamental storage unit in Filecoin is known as a sector, which has a standardized size and a lifespan that providers can extend [3]. This design carefully balances security and usability. All user data stored on Filecoin is encrypted, and multiple copies are distributed across the network, ensuring that miners cannot access the contents of the files.
Miners' influence in the Filecoin network is proportional to the amount of storage they provide, which also allows them to participate in the network's consensus mechanisms. The Filecoin Virtual Machine oversees executing smart contracts and facilitates market operations, such as pairing storage providers with users.
Filecoin’s architecture is modular, allowing nodes to operate specific parts of the system as needed. For instance, a node could exclusively function as a storage node without engaging in market operations.
For data integrity and availability, filecoin relies on two algorithms: Proof of Storage and Proof of Replication.
Proof of Storage
Miners in Filecoin produce proofs to verify that they hold a copy of data at any given time. This proof is achieved through challenges: the system poses questions to the miners that they can only answer correctly if they have the data.
To ensure that miners don't just copy the data whenever a challenge is posed, the challenges are designed to target random parts of the data at unpredictable time intervals. The combination of randomness and lack of clarity on intervals makes it impossible, unprofitable, and irrational for miners to fetch data only as challenged [4].
Proof of Spacetime (PoSt)
Going beyond one-time proving, Filecoin introduced Proof of Spacetime (PoSt) to ensure consistent storage and data availability. Proof of Spacetime verifies storage over time intervals by giving miners cryptographic challenges that they can only pass if the file is stored over the indicated timeframe [5].
PoSt poses two types of challenges:
Winning PoSt: where the miner verifies that they have stored a copy of the data at a specific time, typically when the algorithm selects the miner to mine the next block. The short deadline ensures they have the data.
WindowPoSt: a recurring challenge where miners submit proof that they have maintained the data as required. It would be more expensive for a miner to seal the data only when asked to submit it.
Sealing is part of the Proof-of-Replication algorithm and is computationally intensive, making it desirable to reduce the need to seal as much as possible for a rational miner [5].
Proof of Replication
Sealing is part of the Proof of Replication algorithm and is computationally intensive, encouraging miners to reduce the frequency of sealing. Proof of Replication assures users that a miner has created and stored a unique copy on their physical hardware. This proof includes:
The data itself.
The miner who sealed it.
The time and date of sealing.
The block height at which the data was sealed.
Requiring miners to generate both proofs acts as a guarantee for users that files are securely stored and only miners providing actual storage receive rewards. Since the proof is too large to be stored on-chain, miners generate a zkSNARK, which is then submitted onchain, making Filecoin the largest zkSNARK user with 6 - 7 million proofs generated daily [6].
Overall, Filecoin combines both deterministic proofs (PoRep) and probabilistic proofs (PoSt) for a hybrid approach.
Celestia
The third example in this post is Celestia, a so-called data availability blockchain that provides execution and data storage for modular blockchains, allowing them to outsource core functionality.
With the rise of Ethereum rollups, DA solutions like Celestia have gained popularity for offering a cheaper alternative to host rollup's transaction data than Ethereum archive nodes.
Proving Data Availability
Unlike Filecoin, Celestia does not provide a storage solution for end-users but focuses solely on solving data availability issues. Data availability ensures that blockchain data has been properly published. Typically, blockchain nodes must download entire blocks to verify availability, a resource-intensive process that can hinder verifiability.
To streamline this process, Celestia employs Data Availability Sampling (DAS). This method involves light nodes downloading only a small portion of data until a predetermined confidence level is reached. If the data in the sample is all available, it is considered published, serving as a probabilistic proof of data availability [7].
Here's how it works:
A proposer creates a data block.
The block data is split into k×k chunks, forming a matrix.
This matrix is expanded by adding parity data, creating a 2k×2k matrix using Reed-Solomon encoding. This type of encoding allows the recovery of the entire data set from a subset of the data.
Separate Merkle roots for each row and column of the extended matrix are computed and combined.
Finally, the Merkle root of all these combined roots is added to the block data commitment in the block header, confirming the data's availability.
To verify availability, light nodes randomly sample unique coordinates in the extended matrix and then query full nodes for the data chunks corresponding to the Merkle proofs at those coordinates. If the responses are correct, they attest to a high probability that the entire block's data is available.
Nodes then broadcast the received data chunks with correct Merkle proofs to the rest of the network. As long as sufficient chunks are sampled, nodes can reconstruct entire blocks, enabling Celestia to rely more heavily on resource-limited nodes for verification, thereby aiding decentralization efforts.
At the time of writing, Celestia is still new. Nevertheless, Data Availability Sampling is a technology that might find adoption beyond Celestia, with core Ethereum developers discussing adding it to the protocol to aid scaling efforts.
Conclusion
In conclusion, various methods for storing and verifying the availability of data in distributed networks are operational and being actively utilized.
Tagion employs a DART database that enables sharding to enhance throughput and support the development of specialized sub-ecosystems secured by sparse Merkle Trees.
Filecoin's architecture leverages two distinct algorithms, Proof of Spacetime and Proof of Replication, enabling miners to verify and demonstrate that they have reliably stored data. These proofs are subsequently recorded on-chain in the form of zero-knowledge proofs.
Celestia, serving as a data availability layer, utilizes Reed-Solomon encoding to expand data blocks into matrices. This structure allows light clients to perform random sampling to confirm data availability, circumventing the need to download entire data sets.
As the landscape of distributed storage systems continues to evolve, Tagion, Filecoin, and Celestia each present unique strategies to ensure data integrity, availability, and accessibility. Together, these platforms contribute significantly to the development of resilient data publishing and storage systems that power a decentralized web.
About the Author
W-SOURCE is an early stage Venture Builder and a micro fund, focusing on deeptech & web3. W-SOURCE is pioneering a new model in investmenting, by spearheading tech to market for our portfolio companies.
[1] https://docs.tagion.org/docs/architecture/DART
[3] https://spec.filecoin.io/systems/filecoin_mining/sector/
[4] https://spec.filecoin.io/algorithms/pos/
[5] https://spec.filecoin.io/systems/filecoin_mining/sector/sealing/
[6] https://twitter.com/FilecoinTLDR/status/1722984986196774962
[7] https://celestia.org/glossary/data-availability-sampling/