An Overview of OLAP Systems' History

The Final Essay of CMU 15721 - Advanced Database Systems

Posted by LYriC on June 9, 2023

Background

Since the invention of the first Database Management System (DBMS) in the 1960s, the purpose of DBMS has not changed from storing and querying data. Systems focusing on the former are named Online Transaction Processing (OLTP) Database while the latter are called Online Analytical Processing (OLAP) Database. There was barely any distinction between the two until the proposal of NSM storage format, aka columnar storage by S. Khoshafian et.al. in the 1980s [1]. C-Store, which first implemented the theory, came out later in 2005.

Columnar storage format marked the deviation of OLAP from OLTP because it has significant benefits for OLAP workloads but adversely affects OLTP performance. Since OLAP workloads are mostly associated with large read-in data size, storing table in columns and using projection pushdown reduces disk I/O, therefore, boosting the query performance [2]. To fully appreciate why this approach works requires a comprehensive understanding of the properties of the underlying hardware. Adapting software to cater to the most recent hardware has been one of the major trends in DBMS development.

Impacts of Hardware Development

One main focus of database developers from both the industry and academia has always been making the best use of the available hardware. Therefore, advancement in database-related hardware such as RAM, Disk Storage and CPU usually impacts database’s design.

Database systems before the 1980s were largely memory bounded due to relatively expensive RAMs and slow disk access. Cost of computer memory, both RAM and disk had seen a continuous decrease for years, which makes it possible for DBMS to utilize caching in memory more extensively or even store everything in memory. Sybase IQ (1990) was introduced as the first in-memory DBMS. Increasingly powerful disks together with the DSM storage format made the bottleneck of DBMS become computation. Therefore, the efficiency of computation algorithms became more important after the 2000s.

There was abroad research on the topics of accelerating query execution. Some research focused on certain algorithms. For example, [3, 4] tried to compare different parallel joins algorithms on up-to-date hardware. These experiments generally assumed data in memory. Other research went further to discuss fundamental changes in DBMS structure. Two of the most impactful effort are query compilation and query vectorization. T. Neumann [5, 6] identified the bottleneck of computation as lack of locality and frequent branch misprediction in CPU. One way of solving the issue is by compiling the query plan as machine code instead of interpreting it at runtime. Hyper (2010) was the first DBMS that adopts such approach. Vectorizing query execution is another way to go. The basic idea of vectorization is to share the overhead of preparation for execution on each row among a vector of values. MonetDB (2002) was the first vectorized DBMS that proves vectorization can significantly decrease IPC used for each computation. Vectorized computation can also benefit from the Single Instruction Multiple Data (SIMD) technology in modern CPU [7]. These two approaches have largely shaped the architecture of modern OLAP systems and will still be in effect before the next breakthrough in hardware.

From the above, we mainly discussed how characteristics of memory and CPU individually determine DBMS’s design and implementation. However, balancing between CPU and memory performance to avoid under-utilization is equally important. Algorithms that find the sweet spot in both are more generally applicable.

  1. Traditional DBMSs often use B-tree as the primary index. While B-tree can minimize disk I/O, maintaining and traversing it is expensive on CPU. Therefore, OLAP systems tend to favor indexes that are less precise but cheaper to store and compute, such as Zone Maps [8].
  2. Compression in OLAP is easier because of Columnar Store. While there are considerable column compression algorithms to choose from. It is recommended to use lightweight compression that does not require much CPU workload [9].

Optimizing DBMS in the constraints of different hardware is exciting. However, as single-core performance approached its ceiling, engineers tried to include multiple cores in one CPU, or even connect multiple machines together to work on one task. This led to new challenges and opportunities in database development.

Parallel Processing, Distributed Computing and Cloud Platforms

Multiple-core CPUs are designed to continue Moore’s law in the 2000s. As it got prevalent, parallel processing capability became even more essential for all databases. Parallelism in DBMS is categorized as inter-query and intra-query. Inter-query parallelism is easier to think of and realize. It is about multiple workers in the DBMS assigned to handle concurrent queries. Some databases like PostgreSQL fork new process as worker while others like Oracle Database creates thread as worker. Intra-query parallelism is much harder to achieve as primary “plan-driven” parallelism suffers from unbalanced load and frequent context-switching [10]. The state-of-art solution for intra-query parallelism is Morsel-Driven parallelism which uses autonomous Non-Uniform Memory Access (NUMA)-Aware thread workers [10].

Nevertheless, in the age of the Internet Boom, a single multi-core machine had become insufficient in handling Data Warehouse query demands. Large IT companies started to think of organizing numerous machines together to work on the same job in a distributed manner. For example, Google proposes MapReduce [11] in 2004 to analyze massive data from the Internet.

Another kind of DBMS that trades relational models for element query performance was also born in the late 2000s. It is called NoSQL due to the lack of support for relational data models and ACID transactions at first. Examples of NoSQL such as Cassandra (2008), MongoDB (2009) and DynamoDB (2012) often took a shared-nothing architecture that scales out to multiple separated nodes which don’t share resources. In the 2010s, NewSQL came to life recovering the idea of ACID and Relational SQL model. However, NewSQLs were also fading after the rise of Cloud Computing.

As we entered the 2010s, network bandwidth improvement made it possible for Network Attached Storage (NAS) to replace Direct Attached Storage (DAS) in shared-nothing distributed DBMSs [12]. In replacement of the primary form of cloud database that simply containerized existing systems, some new systems were designed from the ground up to fit in Cloud natively. Snowflake became available in 2013 which adopts a shared-disk structure. The design principle of it can be concluded as 1) push predicate to disaggregated storage layer and 2) costless idle computation node [13]. Subsequent Cloud Databases generally also adopted these two rules.

State-of-Art OLAP System Analysis

Thanks to the huge demand for data analysis in all business types, the OLAP system market has never been more prosperous. Each system that stood out from fierce competition demonstrated unique features favored by its user. OLAP system developers have to learn from their common aspects and reason about their difference.

  1. Almost all modern database supports SQL interface. We have seen the trend of giving up SQL in NoSQL in the 2000s, but that did not end up well [12]. Clearly, database users are more inclined to have a standardized interface. Other classical features in relational database like ACID and relational model are also in high demand. DBMS developers are advised to think carefully before abandoning them.
  2. High-performance database execution engine should have full control of its memory usage using low-level language like C++. In the 2010s, it was popular to invent new DBMSs using higher level languages such as Spark (2014, Scala) and QuestDB (2014, Java). However, problems of loosely controlled memory arose later and they have to reinvent the wheel using C++ [20].
  3. Although many well-known state-of-art DBMSs created their monolithic system from the ground up. It may not be worthwhile to do so in the 2020s, as OLAP system commoditization is obviously around the corner [12]. There are well-established database components that developers can plug and use, such as [14] HCatalog (System Catalog), Arrow (File Format) and Velox (Execution Engine).
  4. Push-based and Pull-based describes the flow of data between operators. Pull-based method generalized the traditional volcano-style model and is easier to implement. Push-based method was first proposed by [5], which prepares the data first using scan and pushes the data to the operator pipelines. In the long run, push-based models seem to be more suitable for compiling or vectorized architecture and have better performance.
  5. Distributed databases are generally harder to install and configure. Except for large corporations that can manage their own cluster, most small to medium businesses may prefer serverless databases provided by Cloud vendors. New Distributed Database should consider running on the cloud natively. Otherwise, it can face a market segment that has been extruded to niche.

Conclusion

Now may be the best days for OLAP systems but hardest days for startups in this field. Every popular OLAP system seems to be extremely powerful and has its own rampart. However, with all the available lessons, new DBMS may still thrive following the correct design principle.

  1. Find a specific subdivision. OLAP systems nowadays are facing diverse use cases. Some systems focus on single-node computation performance, such as Umbra (2018) and Vectorwise (2010); Cloud Databases target throughput and scalability; Embedded Databases are specialized in usability and robustness; other systems include stream processing, log processing. Each of these subdivisions has distinct requirements, it is good enough for new systems to specialize in one division.
  2. Make proper technology tradeoffs given the use case. There is no de facto formula for any type of database, but there are certain open design decisions to make.
  3. Single-node systems and some cloud databases (like Redshift) need to consider whether to compile the query. While compilation reduces CPU branch misprediction rate. It can significantly increase the complexity of coding and debugging.
  4. Cloud and Distributed Databases have to craft their system architecture. Available options are not limited to
    1. whether to have shuffle nodes in between execution nodes like Apache Dremel [15, 16]
    2. whether to support query level scalability by borrowing nodes from other warehouses as Snowflake [13] iii. how to maintain the system catalog of more accurate statistics [18].
  5. Regardless of which database type, query optimizer will be the most essential component that makes a difference. Worst-case Optimal Join is becoming a requirement for all query optimizers [17]. Other potential improvements on the optimizer include but are not limited to, adaptive change of query plan, more accurate cost model, join flattening and using machine learning models in cost estimation [18, 19].

References

[1] S. Khoshafian, G. Copeland, T. Jagodits, H. Boral, and P. Valduriez, “A query processing strategy for the decomposed storage model,” 1987 IEEE Third International Conference on Data Engineering, 1987.

[2] D. Abadi, et al., Column-Stores vs. Row-Stores: How Different Are They Really?, in SIGMOD, 2008

[3] S. Schuh, et al., An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory, in SIGMOD, 2016

[4] C. Balkesen, et al., Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited, in VLDB, 2013

[5] T. Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, in VLDB, 2011

[6] T. Kersten, et al., Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask, in VLDB, 2018

[7] H. Lang, et al., Make the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines, in VLDB Journal, 2020

[8] A. Pavlo, “CMU 15-721 :: Advanced Database Systems (spring 2023),” Available: https://15721.courses.cs.cmu.edu/spring2023/schedule.html#feb-01-2023.

[9] D. Abadi, et al., Integrating Compression and Execution in Column-Oriented Database Systems, in SIGMOD, 2006

[10] V. Leis, et al., Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age, in SIGMOD, 2014

[11] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” Sixth Symposium on Operating System Design and Implementation, 2008.

[12] M. Stonebraker, et al., What Goes Around Comes Around… And Around (CMU Only), 2023

[13] B. Dageville, et al., The Snowflake Elastic Data Warehouse, in SIGMOD, 2016

[14] A. Pavlo, “CMU 15-721 :: Advanced Database Systems (spring 2023),” Available: https://15721.courses.cs.cmu.edu/spring2023/schedule.html#jan-23-2023.

[15] S. Melnik, et al., Dremel: Interactive Analysis of Web-Scale Datasets, in VLDB, 2010

[16] S. Melnik, et al., Dremel: A Decade of Interactive SQL Analysis at Web Scale, in VLDB, 2020

[17] M. Freitag, et al., Adopting Worst-Case Optimal Joins in Relational Database Systems, in VLDB, 2020

[18] V. Leis, et al., How Good are Query Optimizers, Really?, in VLDB, 2015

[19] Yongwen Xu, Efficiency in the Columbia Database Query Optimizer (pages 1-35), in Portland State University, 1998

[20] A. Behm, et al., Photon: A Fast Query Engine for Lakehouse Systems, in SIGMOD, 2022

Disclaimer:

To explain every concept discussed in the paper, there are simply too many references to cite. This paper may have cited only one representative of a few excellent papers about the topic. Also, sometimes it can be hard to tell whether the idea came from a specific paper or the course instructor Prof. Andy Pavlo. All these possibly overlooked references can be found on the course website: https://15721.courses.cs.cmu.edu/spring2023/schedule.html.