Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.
A decentralized database splits the workload up among multiple machines and uses sophisticated algorithms to balance the incoming and outgoing requests for the best response time. This type of database is useful for those times when there is more data that needs to be stored in the database than can physically saved on one physical machine. The bits — like log files, data collected by tracking click-throughs in the application, and the data generated by internet of things devices — pile up and need to be stored somewhere. They are also frequently referred to as distributed databases.
There are several good reasons for splitting up a database:
- Size: The largest commodity disk drives available at the time of this writing are 18 terabytes. Some data sets are larger than can be stored on a single drive. These data sets must be split up across multiple drives.
- Demand: If many users are trying to access the data at the same time, database performance suffers. Splitting the workload means that more machines can answer more requests, and users don’t notice any performance delays.
- Redundancy: Drives can fail. If the data is valuable, creating multiple copies and storing them across multiple machines protects against hardware failure.
- Geographic redundancy: Spreading out multiple copies in different locations reduces the threat of catastrophic fire, natural disaster, or power outage.
- Speed: Network latency is still a problem when the database and the user making queries are geographically far apart. Placing copies of the data in centers close to the user results in faster responses because the data doesn’t have to travel as far. Speed is especially important for projects that work with people in different continents.
- Computational load: Some data sets have to be distributed because the computational load required during analysis is too large for one machine to handle. A machine learning application, for instance, may distribute large data sets across multiple systems in order to spread out the analytical work, which can be quite substantial.
- Privacy: Some data sets are split up to maximize privacy and minimize the risks in case of a data breach. If different parts of the data are stored on different machines, even if one part is exposed in a breach, the rest of the data is still safe.
- Politics: When multiple groups use the same data set, there may be some challenges over governance. Having the data stored across multiple machines can be useful if some data is stored with one group and some other data is managed by another group.
One approach to simplify the architecture is to split the dataset into smaller parts and assign the parts to certain machines. One computer might handle all people whose last name begins with A through F, another G through M, etc. This splitting, often called “sharding,” can inspire strategies that range from simple to complex.
Distributing a database can be tricky
The greatest challenge with splitting up the database is ensuring that the information remains consistent. For example, in the case of a hypothetical airline booking system, if one machine responds to a database query that an airplane seat has been sold, then another machine shouldn’t respond to a query by saying that the seat is open and available.
Some distributed databases enforce the rules on consistency carefully so that all queries receive the same answer, regardless of which node in the cluster responded to the query. Other distributed databases relax the consistency requirement in favor of “eventual consistency.” With eventual consistency, the machines can be out-of-sync with each other and return different answers, so long as the machines eventually catch up to each other and return the same results. In some narrow cases, one machine may not hear about the new version of the data stored on another machine for some time. Machines in the same datacenter tend to reach consistency faster than those separated by longer distances or slower networks.
Database developers must choose between fast responses and consistent answers. Tight synchronization between the distributed versions will increase the amount of computation and slow the responses, but the answers will be more accurate. Allowing data to be out of sync will speed up performance, but at the expense of accuracy.
Choosing whether to prioritize speed or accuracy is a business decision that can be an art. Banks, for instance, know their customers want correct accounting more than split-second responses. Social media companies, however, may choose speed because most posts are rarely edited and small differences in propagation aren’t essential.
Legacy approaches to distributed systems
The major database companies offer elaborate options for distributing data storage. Some support large machines with multiple processors, multiple disks, and large blocks of RAM. The machine is technically one computer, but the individual processors coordinate their responses in similar ways as if the processors were separated by continents. Many organizations run their Oracle and SAP deployments on Amazon Web Services in order to take advantage of the computing power. AWS’ u-24tb1.metal, for instance, may look like one machine on the invoice, but it has 448 processors inside, along with 24 terabytes of RAM. It is optimized for very large databases like SAP’s HANA, which stores the bulk of the information in RAM for fast response.
All of the major databases have options for replicating the database to create distributed versions that are split between more distinct machines. Oracle’s database, for instance, has long supported a wide range of replication strategies across collections of machines that can even include non-Oracle databases. Lately, Oracle has been marketing a version with the name “autonomous” to signify that it’s able to scale and replicate itself automatically in response to loads.
MariaDB, a fork of MySQL, also supports a variety of replication strategies that allow the data from one primary node to pass copies of all transactions to replicas that are commonly set up to be read-only. That is, the replica can answer queries for information, but it doesn’t store new data.
In a recent presentation, Max Mether, one of the cofounders of MariaDB, says his company is working hard at adding autonomous abilities to its database.
“The server should know how to tune itself better than you,” he explained. “That doesn’t mean you shouldn’t have the option to tune the server, but for many of these variables, it’s really hard as a user to figure out how to tune them optimally. Ideally you should just let the server choose, based on the current workload, what makes sense.”
Upstarts handle distributed differently
The rise of cloud services hides some of the complexity of distributing the databases, at least for configuring the server and arranging for the connection. DigitalOcean, for instance, offers managed versions of MySQL, PostgreSQL, and Redis. Clusters can be created with a certain size with a single control panel to offer storage and failover.
Some providers have added the ability to spread out clusters in different datacenters around the world. Amazon’s RDS, for instance, can configure clusters that span multiple areas called “availability zones.”
Online file storage is also starting to offer much of the same replication. While the services that offer to store blocks of data in buckets don’t provide the indexing or complex searching of databases, they do offer replication as part of the deal.
Some approaches work to merge more complex calculations with distributed data sets. Tools like Hadoop and Spark, for instance, are just two of the popular open source constellations of tools that match distributed computation with distributed data. There are a number of companies that specialize in supporting versions that are installed in house or in cloud configurations. Databricks’ Delta Lake, for instance, is one product that supports complex data mining operations on distributed data.
Groups that value privacy are also exploring complicated distributed operations like the Interplanetary File System, a project designed to spread web data out among multiple locations for speed and redundancy.
What distributed databases can’t do
Not all work requires the complexity of coordinating multiple machines. Some projects may be labeled “big data” by project managers who feel aspirational, even though the volume and computational load is easily handled by a single machine. If a fast response time is not essential and if the size is not too large and won’t grow in an unpredictable way, a simpler database with regular backups may be sufficient.
This article is part of a series on enterprise database technology trends.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more