Skip to main content

Alternative to MongoDB for Blob Data

· 9 min read
Anthony Cavin
Data Scientist - ML/AI, Python, TypeScript

ReductStore vs MongoDB

In edge computing, managing time series blob data efficiently is critical for performance-sensitive applications. This blog post will compare ReductStore, a specialized time series database for unstructured data, and MongoDB, a widely-used NoSQL database.

Using Docker containers for straightforward setup, we'll examine the speed of each system. We'll go through setting up ReductStore buckets and preparing MongoDB collections, focusing on how to effectively store and access blob data for time series scenarios.

By conducting performance tests on binary data insertion and retrieval, we aim to provide insights into which system might best serve your application's needs.

For those interested in replicating our benchmarks or conducting their own evaluations, we've made our methods easily accessible through this repository.

Launching ReductStore and MongoDB Using Docker

To launch ReductStore and MongoDB using Docker, create a docker-compose.yml file with the necessary configurations.

This setup initializes a MongoDB container with exposed ports for local access, sets up user credentials, and mounts a local volume for data persistence. Similarly, it configures ReductStore with an API token and bucket details while also mounting a volume to persist its data.

The process is straightforward:

  • Define service configurations for both mongodb and reductstore.
  • Specify Docker images: mongo:latest for MongoDB and reduct/store:latest for ReductStore.
  • Configure environment variables such as credentials and bucket properties.
  • Assign volumes to ensure data persists across container restarts.

Here's the Docker Compose snippet:

version: "3"
services:
mongodb:
image: mongo:latest
ports:
- "27017:27017"
volumes:
- ${PWD}/data/mongodb/db:/data/db
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: password

reductstore:
image: reduct/store:latest
ports:
- "8383:8383"
environment:
RS_API_TOKEN: reductstore
RS_BUCKET_1_NAME: benchmark
RS_BUCKET_1_QUTA_TYPE: FIFO
RS_BUCKET_1_QUOTA_SIZE: 50TB

volumes:
- ${PWD}/data/reductstore:/data

Running this configuration will start both databases in their respective containers, enabling you to work with time series blob data in your applications.

Preparing ReductStore Buckets and MongoDB Collections for Time Series Data

When preparing storage solutions for time series data, it’s necessary to configure each system to handle the specific nature of the information being processed.

For MongoDB, creating a time series collection is straightforward—simply use a script that connects to the database and sets up a collection with the appropriate configuration.

Here's an example in Python:

def setup_database():
with MongoClient(CONNECTION_MONGODB) as client:
db = client["benchmark"]
if "data" not in db.list_collection_names():
db.create_collection("data", timeseries={"timeField": "time"})

In this snippet, we connect to MongoDB, check whether a collection named data exists within our benchmark database, and create one configured as a time series with time as our designated time field.

For ReductStore bucket creation, you can either utilize Python code or apply infrastructure as code practices by setting environment variables. As in the previous section, by using environment variables that specify API tokens and bucket configurations such as name, quota type (e.g., FIFO), and size:

RS_API_TOKEN: reductstore
RS_BUCKET_1_NAME: benchmark
RS_BUCKET_1_QUTA_TYPE: FIFO
RS_BUCKET_1_QUOTA_SIZE: 50TB

By defining these settings in your environment or through your deployment scripts, you instruct ReductStore on how to initialize your buckets when it starts up.

Writing Time Series Blob Data: Comparing MongoDB Insertions to ReductStore

When writing time series blob data, MongoDB and ReductStore offer different approaches. MongoDB, a flexible NoSQL database, handles unstructured data via GridFS when dealing with large blobs. ReductStore is designed specifically for time series data storage, offering streamlined methods for handling binary data.

Here's how each system operates:

Write to MongoDB

It is possible to write binary data directly to MongoDB collections. However, for large binary objects, GridFS is recommended. GridFS is a specification for storing and retrieving large files in MongoDB, breaking them into smaller chunks for efficient storage and retrieval.

The following Python script demonstrates how to write time series blob data to MongoDB using GridFS:

  • Establish a connection using MongoClient.
  • Access the database and GridFS.
  • Insert blob data into GridFS and reference it in the collection with a timestamp.
def write_to_mongodb():
setup_database()

with MongoClient(CONNECTION_MONGODB) as client:
db = client["benchmark"]
fs = GridFS(db)
data = db["data"]
count = 0
for _ in range(BLOB_COUNT):
blob_id = fs.put(CHUNK, filename=f"blob_{datetime.now().timestamp()}")
data.insert_one({"time": datetime.now(), "blob_id": blob_id})
count += BLOB_SIZE

return count

Write to ReductStore

  • Create an asynchronous session using ReductClient.
  • Access the desired bucket within ReductStore.
  • Create a batch object to store multiple blobs and write them to the bucket.
  • Write the batch to the bucket and clear it for the next set of blobs. In our example, we write blobs in chunks of 80 or until the batch size exceeds 8 MB.
async def write_to_reduct():
async with ReductClient(
CONNECTION_REDUCT, api_token="reductstore"
) as reduct_client:
count = 0
bucket = await reduct_client.get_bucket("benchmark")
batch = Batch()
for _ in range(BLOB_COUNT):
batch.add(timestamp=datetime.now().timestamp(), data=CHUNK)
await asyncio.sleep(0.000001) # To avoid time collisions
if len(batch) > 80 or batch.size > 8_000_000:
await bucket.write_batch("data", batch)
count += batch.size
batch.clear()

# Write the last batch
if len(batch) > 0:
await bucket.write_batch("data", batch)
count += batch.size

return count

Comparing these two methods shows that MongoDB requires setting up GridFS explicitly while also maintaining references between filesystem objects and collection documents.

Conversely, ReductStore simplifies this process by providing direct binary writes to buckets without additional setup. These differences are critical when choosing a platform based on ease of implementation and alignment with your application's unstructured time series data management needs. In addition, ReductStore's batch approach allows for efficient writing of multiple blobs, ensuring optimal performance when handling large amounts of binary data.

Reading Back the Time: Retrieval of Blob Data from MongoDB vs. ReductStore

Retrieving time series blob data from MongoDB and ReductStore involves querying their respective storage structures with time constraints.

Read from MongoDB

In MongoDB, this requires utilizing GridFS alongside collections to locate and read the binary content based on a time range. The Python script for MongoDB connects to the database, queries the collection within a specified time frame, and reads each corresponding blob from GridFS.

def read_from_mongodb(t1, t2):
count = 0
with MongoClient(CONNECTION_MONGODB) as client:
db = client["benchmark"]
fs = GridFS(db)
data = db["data"]
for rec in data.find(
{
"time": {
"$gt": datetime.fromtimestamp(t1),
"$lt": datetime.fromtimestamp(t2),
}
}
):
blob = fs.get(rec["blob_id"]).read()
count += len(blob)

return count

Read from ReductStore

Conversely, ReductStore offers an asynchronous interface designed for efficient retrieval of time series data. The given example demonstrates initiating an async connection to ReductStore, accessing a bucket, and using its query method to fetch records in the desired time interval directly.

async def read_from_reduct(t1, t2):
async with ReductClient(
CONNECTION_REDUCT, api_token="reductstore"
) as reduct_client:
count = 0
bucket = await reduct_client.get_bucket("benchmark")
async for rec in bucket.query("data", t1, t2, ttl=90):
count += len(await rec.read_all())

return count

The read_all method in ReductStore retrieves the entire binary content of a record. The read method, on the other hand, allows for reading chunks of data in a streaming fashion, providing flexibility in handling large binary payloads.

Performance Analysis: Insertion and Retrieval Speeds for Binary Data in MongoDB and ReductStore

To evaluate the performance of MongoDB and ReductStore, we conducted a test using varying blob sizes ranging from 1KB KB to 1 MB. The test measured the speed of writing and reading binary data on an NVMe drive.

The results are summarized in the table below:

Chunk SizeOperationMongoDB, blob/sReductStore, blob/sReductStore, %
1 KBWrite7997994+900%
Read200847979+2300%
10 KBWrite7837431+850%
Read191832888+1600%
100 KBWrite6943612+420%
Read17306250+260%
1 MBWrite246663+170%
Read776540-30%

The results show that ReductStore outperforms MongoDB with GridFS enabled across all blob sizes, with significant performance gains in both write and read operations. ReductStore's batch write approach and direct binary storage contribute to its superior performance, making it an attractive option for applications dealing with large time series blob data.

Additional Criteria

When selecting a database for blob storage, other criteria than performance must be taken into account:

  • Retention policy: both ReductStore and MongoDB offer distinct retention strategies. ReductStore uses disk utilization-based policies, essential for edge computing and IoT. MongoDB's time-series collections, on the other hand, can automatically purge documents after a set period of time, useful for managing data lifecycle when dealing with non-blob data such as numerical values or logs in time series scenarios.
  • Replication features: Replication in ReductStore is append-only, with label-based filtering options. This feature can contribute to an effective data reduction strategy by selectively replicating relevant blobs. MongoDB, on the other hand, offers replication sets for high availability and redundancy, ensuring data integrity and fault tolerance in distributed environments.
  • License Consideration: MongoDB is governed by the Server Side Public License (SSPL), asking service providers using MongoDB to open source their modifications. ReductStore, under the Business Source License 1.1 (BSL), allows production use, subject to payment for larger companies, and transitions to the Mozilla Public License Version 2.0 (MPL-2.0) after three years.
  • Cost considerations: ReductStore's pricing model is based on storage which aligns well with retention policies based on data volume for a predictable cost structure. MongoDB's pricing is based on instance size and usage, which can be more complex to estimate.

Conclusion

In conclusion, ReductStore and MongoDB each bring their unique strengths to the table for handling unstructured time series blob data. While MongoDB might be better suited for structured numerical data, it requires an explicit setup with GridFS for blob storage, whereas ReductStore simplifies the process by directly storing binary data into buckets. According to our performance analysis, MongoDB is less competitive for unstructured data, with ReductStore significantly outperforming it across various blob sizes. This makes ReductStore an attractive option for applications dealing with extensive time series blob data.

In choosing a database for time series blob data, it's important to weigh your application's specific needs against factors like data retention policies, replication capabilities, and the licensing models—MongoDB's Server Side Public License (SSPL) and ReductStore's Business Source License 1.1 (BSL). These considerations, along with cost implications and the performance benchmarks provided, will help you make a well-informed decision on the optimal system for your requirements.

References


For further guidance or queries on the matter, feel free to join our community on the ReductStore Community forum.

Share
Subscribe to our blogs