Growth of relational model: Interdependence and complementary to big data

Received Mar 20, 2020 Revised Jul 27, 2020 Accepted Nov 6, 2020 A database management system is a constant application of science that provides a platform for the creation, movement, and use of voluminous data. The area has witnessed a series of developments and technological advancements from its conventional structured database to the recent buzzword, bigdata. This paper aims to provide a complete model of a relational database that is still being widely used because of its well known ACID properties namely, atomicity, consistency, integrity and durability. Specifically, the objective of this paper is to highlight the adoption of relational model approaches by bigdata techniques. Towards addressing the reason for this in corporation, this paper qualitatively studied the advancements done over a while on the relational data model. First, the variations in the data storage layout are illustrated based on the needs of the application. Second, quick data retrieval techniques like indexing, query processing and concurrency control methods are revealed. The paper provides vital insights to appraise the efficiency of the structured database in the unstructured environment, particularly when both consistency and scalability become an issue in the working of the hybrid transactional and analytical database management system.


INTRODUCTION
Internet computing, smartphones have made every layman, academician, industrialist, developer a part of the database management system. In light of this awareness, there is a pressing demand to provide the best of knowledge in the least time to these customers. Many different types of databases are in use today. However, for many years, the relational database in various splendiferous models captured in the form of row-store, column-store and hybrid store has been around. The traditional model of a database of single machines can no longer meet the growing demands. The information must be put away on disk or in a remote server as the measure of information increases. Therefore, the queries used to fill the visualization may take minutes, hours or longer to return, bringing about long holding up times between the co-operations of every user and diminishing the capacity of the user to investigate the information rapidly [1].
Today, the trend is moving away from a "one-size-fits-all" structure [2]. During the early 1990s, companies wanted to collect data from multiple operating databases into a business intelligence data ware- (the selection of attributes for indexes should be such that they can be built and removed from time to time) and access path dependency (the path to be chosen to fetch the data either sequentially or by indices) [19].
Among the various data models for the database management system proposed over a while, this paper focuses on the relational model which is the most prominent. Early referencing strategies in programming languages didn't distinguish the logical structure of the information from its physical structure. However, the relational model which was introduced in the mid-1970s gave a clear division [20]: information is organized as tables and it is up to the database management framework to choose how these tables are taken care of in the memory [21].
Later, as the size of data increased with the advent of the internet, the need for fast retrieval of data became the need of the hour. This lead to the division of a single relational model to separate transactional and analytical data models maintaining the structured database. On an everyday premise, the mix of addition, deletion, update and basic queries on data is named as online transactional processing (OLTP). However, a typical issue with the database is it appears to be very broad, and the techniques of query processing frequently search the whole set repeatedly. So one of the preliminary approaches used was sampling. In contrast to customary information, be that as it may, sampling information is inalienably questionable, for example, it doesn't speak to the whole populace information. It is along these lines beneficial to return the results of the query as well as the confidence intervals that show the outcomes' accuracy [22]. Besides, there may be no or too few samples in a certain section in a multidimensional space [23]. It needs some further research to produce trustworthy findings. The conventional astuteness says that since (1) numerous samples must be taken to accomplish sensible accuracy, and (2) sampling algorithms must do an I/O disk per tuple inspected, while query assessment algorithms can amortize the expense of an I/O disk on a page containing all tuples, and (3) in sampling, the overhead of starting an activity is n times when n is the number of samples taken; however if the query is analyzed, the overhead is just once brought about, the cost of evaluating the size of the query by sampling is too high to ever be effective [24]. This eventually leads to online analytical processing (OLAP) becoming attractive. Further, as velocity, assortment and volume of information have expanded the requirement for a mix of both transactional and explanatory databases is unavoidable for constant real-time decision making [25]. This platform which empowers to deal with an assortment of information and executes the two exchanges and complex inquiries with low latencies are termed as hybrid transactional and analytical processing (HTAP). Figure 1 represents a block diagram that summarizes the key points covered in the paper.

DATA STORAGE MODELS
The structure of the database can be seen as the process of representation, index creation, processing, and maintenance. Nelson et al. [26] proposed the evolutionary list file [ELF] structure which was adopted in the design of the relational model. ELF structure emphasis the key elements of file structure as entities, lists,  [27], incorporating reference numbers now known as primary keys, user access rights, and B-tree concept. Thereafter, several relational data models were proposed of which a few are mentioned below.

N-ary storage model (NSM)
This model is also well-known as a row-store where N stands for the number of columns in the table. Most of this DBMS followed the volcano-style processing model in which one tuple (or block) of information is processed at a time. Row stores claim to stake for workloads that query an increased number of attributes from wide tables. These workloads include business (e.g., network performance and management applications) as well as in scientific fields (e.g., neuroscience, chemical, and biological applications). Several important datasets, related to the medical field are made of more than 7000 attributes. One of the solutions being, constantly increasing support for large tables, for instance, SQL Server today permits 30K columns per row but at the maximum, only 4096 columns are allowed in the SELECT query. This storage layout shows slower performance for OLAP queries where there is a need for a limited number of attributes, as row-stores generate disk and memory bandwidth overhead [28,29].

Decomposition storage model (DSM)
DSM [30] partitions an n-property relationship vertically into n sub-relations, which are accessed based on the interest for the corresponding values of the attribute. In each sub-relation of the component, a surrogate record-id field is required with the goal that records can be arranged together. Sybase-IQ incorporates vertical partitioning for data warehouse applications within conjunction with bitmap indices [31]. DSM needs consid-erably less I/O than NSM for table scans with just a few attributes. Queries involving multiple attributes from a relationship need to spend extra time bringing the participating subrelations together; this extra time can be important. There is also space wastage for the extra record-id field in each of the sub-tables.

Partition attributes across (PAX)
PAX [32] addresses the issue of low cache utilization in NSM. Unlike NSM, all values of a specific attribute are grouped by PAX on a mini page within each page. PAX makes full use of the cache resources during sequential access to data, as only the values of the required attribute are loaded into the cache. PAX does not optimize the I/O between disk and memory relative to DSM. Unlike NSM, PAX loads all the pages that belong to the relationship for scans into the memory, regardless of whether the request needs all or only a few of the attributes.

Multi-resolution block storage model (MBSM)
MBSM [33] is responsible for handling both I/O performance and main memory cache use. It shares the positive cache behavior of PAX where attributes are stored in physical contiguous segments. Disk blocks are grouped in super-blocks with a single-record stored in a partitioned manner among the blocks and then these blocks in-turn are organized in fixed-size mega-blocks on the disk. MBSM only loads pages from the disk with the values of related attributes. Better performance while inserting/updating data and does not need a join to rebuild records. They work faster for sequential scanning and slower for random. They are optimized for un-compressed fixed-width data.

Column-stores
Column-store systems [34] partition a database vertically into a collection of separately stored individual columns. Here, instead of the primary keys, virtual ids are used as tuple identifiers, as ids reduce the width of the information stored on the disk. The offset location is used as the virtual identifier. It is useful for OLAP operations. It offers better cache usage bypassing cache-line sized blocks of tuples among operators and working on different values one after another, instead of utilizing an ordinary tuple-at-a-time iterator. It also provides late materialization, compression and decompression techniques, Database cracking and Adaptive indexing features. Druid [35] is an open-source real-time analytical data store which is column oriented in structure. It supports two subsystems one in the historical node for read-optimization and others in the real-time nodes for write-optimization. However, the real-time system is used to ingest high data but does not support data updates. The drawbacks include frequent multiple inserts and updates are slow.  ISSN: 2088-8708

Hybrid stores
If one knows a priori the workload for a given application, then it is possible to use a sophisticated hybrid system that can be precisely calibrated for the given workload. Through the analysis of query workloads, a cache-efficient attribute layout can be designed. This is the concept followed in data morphing [36]. This layout can then be used as a template for storing data in the cache. As a consequence, fewer cache misses occur during query processing when compared to NSM and DSM under all types of workloads. The drawback of this technique is that, if the query workload changes dynamically reorganizing data according to new data layout can take time and make performance decrease. Along these lines, below are mentioned few such hybrid systems [37]. a. In clothe storage model, separate data layouts are designed for in-memory and non-volatile storage.
Clotho creates in-memory pages individually tailored for compound and dynamically changing workloads, and enables efficient use of different storage technologies (e.g., disk arrays or MEMS-based storage devices) [38]. The layout on the disk follows the PAX model and the in-memory pages contain attributes according to queries' demands. b. Fractured mirrors retain the best of the two data models namely NSM and DSM. Fractured mirrors create data layouts of both NSM and DSM in the two disks but they are not identically mirrored [39]. If the NSM table is divided into two equal segments and similarly DSM is divided into two equal segments. Then, one combination will be on one disk and so on the other. Since, if there is a query skew then there will be maximum disk utilization. Also, due to fractured mirrors, better query plans can be formulated to minimize the L1 and L2 cache misses. The drawback of Fractured Mirrors design is the duplication of space c. H2O [40] presents two new concepts. One is data access patterns and multiple storage layout for a singleengine. The system decides an adaptive hybrid system during query processing that which partitioning model is best for that query class and its corresponding data pieces. As the load changes based on the query, the storage, and the access patterns will continuously change to it accordingly. Vertical partitioning is done based on creating groups of columns that are frequently accessed together. d. Dynamic vertical partitioning [41] uses a set of rules for the development of an active system called dynamic vertical partitioning-DYVEP which dynamically partitions distributed database in the vertical order. In the case of static vertical partitioning, some attributes may not be used by queries hence wastage of space. So, in this approach DYVEP monitors queries and collects and evaluates relevant statistics for the vertical partitioning process to decide if a new partitioning is necessary, and if so, it fires the Vertical Partitioning algorithm. If the vertical partitioning scheme is better than the already existing one, then the system reorganizes the scheme. The architecture of DYVEP has three modules: Statistic collector, partitioning processor, and partitioning reorganizer. Acceptable query response time was observed when experimented on a benchmark database TPC-H. Narrow partitions perform better for columns accessed as a major aspect of analytical queries (for example through successive scans). Conversely, wider partitions perform better for columns accessed as a major aspect of OLTP-style query, because such exchanges frequently insert, delete, update, or access huge numbers of a row's fields. It was seen that performance improvement of 20 percent to 400 percent over pure all column or all row structures is both increasingly adaptable and delivers better plans for main memory frameworks than past vertical partitioning approaches. Database device vendors have different storage engines to serve workloads with different characteristics effectively under the same software suite. For example, MySQL supports multiple storage engines (e.g. InnoDB, MyISAM), but it is difficult to communicate on the storage layer between different data formats. More specifically, a different execution engine is required for each storage engine.

RELATED WORKS ON DATA RETRIEVAL
This section is divided into third subsections. The first section presents various strategies adopted for data indexing. The second subsection describes the common query processing techniques used to speed up the data retrieval process. The third subsection describes the concurrency control methods used to prevent transaction mismatches.

Indexing
Relational model indexing [42] is a blessing for the database management system for fast and quick retrieval of data through bypassing the traversal of every row. An index may include a column or more, and sometimes the size of an index may grow larger than the table it is being created for, but eventually, the index will provide a rapid lookup and fast access to the data. This compensates for the overhead of having indexes. Table 1 provides a glimpse of various indexing strategies followed so far. Application domain in-dexes [49] applies to specific application domain-used mainly in applica tions related to spatial data, video clips application-specific software related to indexing has to be incorporated

Query processing
A query is a language expression that identifies information from a database to be retrieved. To address a query, a database optimization algorithm must choose from several known access paths [50]. The point of query optimization is to discover a query assessment technique that limits the most significant performance metric, which can be the CPU, I/O, and network time, the time-space result of locked database objects and their lock span storage costs, complete resource use, even energy utilization (for example for battery-controlled laptops), a mix of the above-mentioned, or some other performance measures [51]. Table 2 presents several related studies dealing with query optimizations. The internal specifics of implementing these access paths and deriving the associated cost functions are beyond the scope of this paper. Table 2. Studies related to query optimization Ref.
Objectives Method [52] Proposed a method of dynam ically adjusting database per formance in a computer system based on the request made by the database query.
Access plans are generated for incoming process query and for each of the plans the desired memory is calculated and the estimated cost for the same is calculated. A comparison is done with the estimated cost for the current memory.
[53] Addresses the problem of cleaning the stale data from Materialized Views This clean sample is used to estimate aggregate query results. The approach is a maintenance view strategy that is employed periodically for a given materialized view. During this periodic interval, the materialized view is marked stale and an estimated up-to-date view is developed. [54] Proposes a caching method whereby the compiler is optimized to perform caching.
Every cache is assigned a Web variable. If a query is run with no WHERE clause, then no free Web variable is assigned. However, if a query has WHERE clause, then that condition field becomes the Web variable. Based on the insert and update on the web variable, accordingly, the cache is refilled [55] Design a web cache with im proved data freshness The main memory database is used to cache result sets of previous queries for subsequent reusing. Invalidation of the cache works by replacing or deleting obsolete data directly in a caching system. For single table queries, the parser extracts the selection predicates from the where clause and maintains a list of predicates and its corresponding rowsets in the cache. [56] Extends traditional lazy evaluation In lazy evaluation, by contrast, the evaluation of a statement does not cause the statement to execute; instead, the evaluation produces a single batch of all the queries involved in building the model until the data from any of the queries in the batch is needed either because the model needs to be displayed, or because the data is needed to construct a new query. [57] Analyses of multiple queries and suggests frequent queries and query components.
Mining query plans for finding candidate queries and sub-queries for materialized views will, in turn, optimize all the queries of these components. These frequent components may represent frequent sub-queries. Here, importance is given to the analysis of query execution plan than to the query text since most of the time query is the same but the representation of it in text form is different. [58] Dynamically adjusts the materialized view set using Clustering With the help of the similarity threshold, a cluster of SQL queries is found and any new query whose similarity is below the threshold, then a new cluster is formed.

Concurrency control
OLTP DBMS supports the part that communicates with end-users. End users send requests for some purpose (e.g. buy or sell an item) to the application. These requests are processed by the application and then the transactions are executed in the database to read or write to the database. Concurrency control has a long and rich history that goes back to the start of database systems. Two-phase locking and time-stamp ordering are the two classes of algorithms used in concurrency control. The first method of ensuring the proper execution of simultaneous DBMS transactions was the two-phase locking (2PL). Under this scheme, transactions must obtain locks in the database for a particular element before a read or write operation on that element can be performed. 2PL is considered a negative approach because it believes that transactions overlap may occur and therefore need to be locked. However, if this lock is uncontrolled, it may lead to deadlock. Another approach is the Timestamp ordering (T/O) concurrency control scheme where an increase in timestamp is produced for serialization of transactions. These timestamps are used during conflicting operations such as reading and writing operations on a similar component, or two different write operations on a similar element [16].
Multi-core systems are now omnipresent [59], conventional DBMSs, however, still scale poorly beyond a few cores. The main weakness of OLTP databases is their concurrency control algorithm. Previous work showed that concurrency control algorithms suffer from bottlenecks of both fundamental and artificial scalability [60,61]. Threads spend a significant amount of time waiting for a global lock to be obtained by a large number of cores. This is a fundamental bottleneck inherent in the way these algorithms work. The two major scalability bottlenecks: (1) conflict detection and (2) allocation of timestamps. The conflict detection bottleneck is detecting and resolving deadlocks is a burdensome operation that consumes most of the transaction's processing time as the number of cores increases, even for relatively low contention levels. Although recent work improves some artificial weaknesses, fundamental bottlenecks remain.
In conventional databases, the function of concurrency control mechanisms is to decide the interleaving order of operations between simultaneous transactions over shared data. But there is no fundamental reason to rely on concurrency control logic during the actual execution, nor is it necessary to force the same thread to be responsible for the execution of both the transaction and concurrency control logic. This significant in-sight was found in subsequent studies [62] which could lead to a complete paradigm shift in how we think about transactions [63,64]. It is important to note that the two tasks of placing the order for accessing shared data and implementing the logic of the transaction are completely independent. Such functions can therefore tech-nically be carried out through separate threads in different phases of execution. For instance, Ren et al. [65] propose ORTHRUS which is based on pessimistic concurrency control, where transaction executing threads delegate locking usefulness to devoted lock manager threads. ORTHRUS depends on explicit message-passing to convey among threads, which can acquaint pointless overhead with transaction execution regardless of the accessible shared memory model of a single machine.
Multi-version concurrency control (MVCC) is a commonly used method for concurrency control, as it enables modes of execution where readers never block writers. MVCC methods have a long history as well. Rather than refreshing data objects, each update makes another version of that data object, with the end goal that simultaneous readers can, in any case, observe the old version while the update transaction continues simultaneously. Therefore, read-only transactions never need to pause and don't need to utilize locking. This is an amazingly desiring property and the motivation behind why numerous DBMSs actualize MVCC [66]. The optimistic approach to concurrency control came from Kung and Robinson, but only one-version databases were considered [67]. The optimistic techniques [68] are used in applications where transaction conflicts are very rare. In this approach, the transactions are allowed to perform as desired until a write phase is encountered. Thereby, increases the degree of concurrency.
Snapshot isolation (SI) [69] is a multi-versioning scheme used by many database systems. To isolate read-only transactions from updates, many commercial database systems support snapshot isolation which not serializable (Oracle [70], SQL Server [71] and others). However, many articles have addressed the conditions under which SI can be serialized or how it can be serialized. In 2008, Cahill et al released a detailed and realistic solution [72]. Their technique requires transactions to check for dependencies of read-write. Techniques such as validating read and predicate repeatability checks have already been used in the past [73]. Oracle TimesTen [70] and solidDB [74] from IBM employ multiple lock types.

RUM conjecture
The fundamental challenges faced by every researcher, programmer or network architect when designing a new access method is how to reduce read times (R), upgrade costs (U), and memory/storage overhead (M). It is observed that while optimizing in any two areas would negatively impact third. Figure 2 presents an overview of RUM conjecture.
The three criteria that people working on database systems are seeking to design for are outlined by researchers from Harvard DB lab: Read overhead, update overhead, and memory overhead [75]. Tree-based  [78] optimize the performance of write. All updates and deletions in LSM Trees do not require searching for disk data and guarantee sequential writings by deferring and buffering all insert, modifying and deleting operations. This comes at a price of higher maintenance costs and compaction requirements and more expensive reads. At the same time, LSM Trees help improve memory performance by avoiding reserving space and enabling block compression.
Compression (for instance, algorithms, for example, Gorilla compression [79], delta encoding, and numerous others) could be used to maximize memory capacity, adding some read and write overhead. Heap files and hash indexes, for example, will give great performance guarantees and lower overhead storage due to the simplicity of file format. Trade efficiency reliability can be tested by using approximate data structures such as Bloom Filter [45], HyperLogLog [80], Count Min Sketch [81], Sparse Indexes such as ZoneMaps [82], Column Imprints [83] and many others. Choosing which overhead(s) to streamline for and to what degree, stays a noticeable piece of the way toward planning another access method, particularly as hardware and workload changes after some time [84].

STORAGE-TIERING TECHNIQUES
Storage tiering [85], or transferring data between different storage types to suit data I/O characteristics to that of storage, is a strategy that administrators have been using for many years. Data aging is the easiest and most common reason for this. That is, as sections of the data age, they change their access patterns, and are less likely to be accessed: the data will cool down and eventually get cold. Cold data have very different I/O patterns from warm and hot data, and thus storage needs. Also, hot data may have specific storage needs for a largely random I/O pattern for an OLTP server workload. For example, SQL Server can use a design based on a per-node or a CPU. Even latches and memory objects can be dynamically partitioned by SQL Server to minimize a hot spot.
There are different customary strategies for surveying whether the data is viewed as significant: (i) frequencies analysis because of the workload, (ii) predefined business rules dependent on expert examination or functional expertise, and (iii) storing (cache) approach. The principle behind the access frequency analysis is simple: the more data is accessed, the more important it is and hence the higher the likelihood of repeated accesses. There are various approaches to quick and resource-efficient monitoring of data accesses. One example is the logging scheme proposed by Levandoski et al. [86] using techniques of sampling and smoothing to minimize memory usage. By contrast, rule-based approaches require application experts to work manually [87]. An example might be: "All payments made older than a year are cold." Business rules have many advantages: knowledge of the domain. High precision can be introduced, especially given the time parameter. But at the same time, experts can devise rules that are too weak and lose the potential for optimization. Also, current online transactional or analytical processing (OLxP) frameworks are getting progressively unpredictable be-cause of the combination of different platforms (for example operational reporting, warehousing, transaction processing, information mining applications, and so forth.) which make rule-driven experts move towards impossible in the long haul. Caching is another way of finding important data. For example, page buffer monitoring using least recently used or most recently used policy [88]. Such methods can be faster than frequency approaches and consume less memory. On the other hand, they may be less precise and have high bookkeeping overhead. Developers need to update their cache configuration constantly as the workload of their application varies [89]. In reality, however, it is difficult to identify changes in workload for large applications [90]. Despite understanding the improvements in the workflow, developers still have to make a lot of effort to understand the new workload and re-configure the caching system manually.
The technique for query filtering and tuple reconstruction was suggested by Boissier et al. [91]. All the attributes used to evaluate the query are stored in DRAM. This info gives enough data to rebuild the information separating it into two segments: (i) tuples kept in main memory and (ii) tuples moved to disk. Two methods were presented to compute the hotness of a table cell and discussed the advantages and disadvantages of both approaches. To do as such, they presented two kinds of bit-vectors: columnar and row vectors, containing data about the hotness of a column or a row. They at that point proposed to cluster segments with comparative hotness behavior and to consolidate columnar vectors of each cluster into a single vector. This helped to reduce the complexity of sorting large datasets and tuple reconstruction.

Big data models-interdependence and complementary 5.1.1. Data storage model
SQL in Map-reduce systems: Mapreduce, well-known for processing very large datasets in a parallel fashion using computer clusters is a highly distributed and scalable program. However, when it comes to joining different datasets or finding the top-k records, it is comparatively slow. To overcome this, the relational approach needs to be adopted. As such two strategies are employed (1) Usage of SQL-like language on top of Map-reduce, e.g., Apache Pig and Apache Hive (2) inte-grate Map-reduce computation with SQL e.g., Spark.
Not only SQL (NoSQL): NoSQL provides higher availability and scalability over traditional databases by using distributed and fault-tolerant architecture [92]. With these computers can be added easily, replicated and even support consistency over a while. The various types of NoSQL being Key-value store (e.g Cassandra, Riak), Document-store (e.g, Couch DB, Mongo DB), Tabular data-stores (e.g, HBase, BigTable) and Graph Database. However, in most of these, creating custom views, normalization and working with multiple relations is difficult. Though originally SQL was not supported by NoSQL, present observations indicate the need for highly expressive and declar-ative query language like SQL. This is observed in a couple of systems like SlamData, Impala and Presto that support SQL Queries on MongoDB, HBase and Cassandra [93].
Graph database is widely explored nowadays to analyze massive online semantic web, social networks, road networks and biological networks. Reachability and shortest path queries are time-consuming to be performed in relational databases using SQL. Writing recursive and chains of joins is unpleasant. However, since RDBMS stores many relations that can be related to graphs in the real application, combination of the two, could produce better results. AgensGraph [94] is a multi-model graph database based on PostgreSQL RDBMS. It enables developers to integrate the legacy relational data model and the flexible graph data model in one database.
NewSQL: This class of database incorporates both the scalability and availability of NoSQL systems as well as the ACID properties of the relational model. Systems with SQL engines like InfoBright and MySQL Cluster support the same interface like SQL with better scalability. Sharding is done transparently in systems like ScaleBase and dbShards providing a middle layer that fragments and distributes databases over multiple nodes. Based on the application scenarios, most of these new SQL systems are used to learn SQL [60].

Indexing
Indexing in big data is a challenge for several reasons [95], primarily because the volume, variety, and velocity (3V's of Gartner's definition) of big data are very large; a generated index should be small (a fraction of the original data) and fast in comparison with the data because searches must consider billions or even trillions of data values in seconds. As big data models entertain multi-variable queries, indexing is required to be in place for all in-operation variables to provide faster access for individual variable search 1789 results that are further combined for a collective response as a whole. Models should have the ability to produce smaller indexes when there is a possibility for granularity reduction. The indexes should be able to be portioned easily into sections whenever the opportunities for parallel processing arise. As the rate of data generated under the big data era is very large, to process data in place, an index should be built at the same rate. The indexing strategies in Big Data can be classified into artificial intelligence (AI) approach and nonintelligence (NAI) approach. In AI, the indexes are formed based on a relationship that is established by observing similar patterns and traits among data items or objects. The two popular approaches for the AI indexing approach are: a. Latent semantic indexing (AI) which uses singular value decomposition for finding patterns between unstructured databases. It consumes more memory space. b. Hidden Markov model is based on the Markov model. It uses states connected by transitions that depend on the present state and independent of previous states. States depend on the current query and store the results and it also helps to predict future queries. It demands high computational performance. In NAI, the indexes are based on frequently queried data sets. It uses indexes used in the relational model like Btree, R-Tree, X-Tree, LSM, bloom filter. Also, there are indexes like generalized search trees, generalized inverted index which is implemented similar to B-tree and supports arbitrary indexes.

Query processing
Even for newer platforms based on MapReduce and its variants, the interest in big data models which are well known for highly parallel platforms, the network-transmission cost, storage cost, I/O cost of movement of data across nodes is significant when query optimization is concerned. The popular frame-work, such as MapReduce hence lags in query processing efficiency [95]. In Map Reduce, a query also has to be translated to a set of MapReduce jobs sentence by sentence. This is time-consuming and dif-ficult. Hence, high-level languages proposed for Map-reduce such as PIG and HIVE resemble SQL in many ways. Google's BigQuery runs in the cloud to overcome the optimization Issues. Here, the data is not indexed instead stored in columnar format and b-tree architecture is used. In this tree, the data is distributed among nodes. It provides a query language similar to SQL. Cassandra employs materialized views for complex query analysis. Expensive table joins and data aggregation is excluded.

Concurrency control
Versioning the database is an important feature to control concurrency in big data models [95]. New changes are assigned a number, known as the version or revision number. It allows us to find the latest version of a database table and eases audit trailing. MongoDB, CouchDB, HBase, Google's BigTable are few examples which use versioning. With this powerful technique of versioning for such voluminous data, comes the potential problems like all the data that makes the particular version needs to be stored again in the database. This could lead to interdependencies between objects so to track the differences is a task.

HTAP
HTAP systems are capable of reducing deployment, maintenance, and storage costs by eliminating the ETL cycle and, most significantly, allowing real-time analysis over production data. Terms such as online transactional analytical processing (OLTAP) or Online transactional or analytical processing (OLxP) are also synonymous with them [96]. The Oracle RDBMS in-memory option (DBIM) [97] is an industry-first distributed dual-format architecture which enables highly optimized columnar-format processing of a database object in main memory to break performance barriers in analytical query workloads while retaining transactional compatibility with the corresponding OLTP optimized row-major format which remains in storage and accesses through the database buffer cache.
SAP HANA scale-out extension [98], a modern distributed database architecture was designed to separate transaction processing from query processing and use a high-performance distributed shared log as the persistence backbone. It provides full guarantees of ACID (Atomicity, Consistency, Isolation, and Durability) and is based on a logical timestamp framework to give MVCC-based snapshot isolation without requiring synchronous replica refreshes. Rather, it uses asynchronous update propagation to ensure consistency with the validity of timestamps. MemSQL [99] is a distributed memory-optimized SQL database which is a shared-nothing architecture that excels in analytical and transactional mixed-time processing. This exploits in-memory data storage with MVCC and memory-optimized lock-free data structures to permit highly simultaneous reading and writing of data, empowering ongoing analytics over an operational database. It ingests as well as keeps the data in row format in the in-memory. However, once data are written to disk, it is translated to columnar format for faster review. Appuswamy et al. [100] designed a new architecture for the hybrid transaction and analytical processing named heterogeneous-HTAP (H2TAP). OLAP queries are data-parallel tasks. HTAP uses the kernel-based execution model for executing OLAP queries on the GPU. However, for OLTP queries which are task-parallel tasks, it uses message passing-based parallelism. Kemper et al. [34] proposed a hybrid system, called HyPer, which can handle both ongoing and analytical queries simultaneously by maintaining consistent snapshots of the transactional data. HyPer maintains the ACID properties of OLTP transactions as well as executes multiple OLAP queries by taking the current and consistent snapshots. The system makes use of the operating system functionality of using fork(). All modifications to OLTP are made to a different area called Delta which consists of replicated database pages. OLAP operations are given the view of virtual memory. Whenever a transactional query is performed using the fork()-operation a virtual memory snapshot is created. This will be used for executing the OLAP session. Frequently, OLAP is converged with the Delta by forking another process for the state-of-the-art OLAP session.
The project Peloton [101] intends to create a self-governing multi-threaded, in-memory HTAP system. It provides versatile data organization of data, which changes the data design at run-time based on-demand type [102]. Wildfire [103] is an IBM Research project which produces an HTAP engine where both analytical and ongoing requests go through the same columnar data format., i.e. Parquet [104] (non-proprietary storage format), open to any reader for all data. The Spark ecosystem is also used by Wildfire to allow distributed analytics on a large scale. This allows analysis of the latest committed data immediately by using a single data format for both data ingestion and analysis.
Sadoghi et al. [105] proposed L-Store (direct information store) that joins ongoing transaction and analytical workload handling inside a single unified engine by creating a contention-free and lazy staging of data from write-optimized to read-optimized data. Pinot [106] is an open-source relational database that is distributed in nature proposed by LinkedIn. Though it is designed mainly to support OLAP, for applications that require data freshness, it is directly able to ingest the streaming data from Kafka. Like a traditional database, it consists of records in tables that are further divided into segments. The data in segments is stored in columnar format. Hence, data size can be minimized using bit packing, etc.

PERFORMANCE FACTORS
Various key performance factors affect the performance of the database. Learning these variables helps identify performance opportunities and avoid problems [107]. The performance of the database is heavily dependent on disk I/O and memory utilization. The need to know the baseline performance of the hardware on which the DBMS is deployed is needed to set performance expectations accurately. Hardware component performance such as CPUs, hard drives, disk controllers, RAM, and network interfaces will have a significant impact on how quickly your database performs. System resources are, therefore, one of the key factors affecting performance. The overall system performance may be influenced by DBMS upgrades. Formulation of the query language, database configuration parameters, table design, data distribution, etc., permit the optimizer of the database query to create the most productive access plans.
The workload equals all the requests from the DBMS, and over time it varies. The overall workload is a blend at some random time of user queries, programs, batch tasks, transactions, and machine commands. For example, during the preparation of month-end queries, there is an increase in workload whereas the other days' workload is minimum. Workload strongly influences database performance. The selection of data structures that adapt to changes in workloads and pinnacle request times plans help to prepare for the most productive utilization of the system resources and enables the maximum possible workload to be handled.
The transaction throughput generally measures the speed of a database system, expressed as a number of transactions per second. Disk I/O is slower relative to other system resources such as CPU which reduces throughput. Hence, techniques are identified to keep more data in cache or memory. The contention is another condition where at least two workload components endeavor to utilize the system in a clashing way for instance, multiple queries endeavoring to refresh a similar bit of information simultaneously or various workloads going after system resources. Along these lines, throughput diminishes as the contention increments. Latency otherwise called response time or access time is a proportion of to what extent it takes the database to react to a single request. Two such latencies in the database are (i) write latency is the number of milliseconds required to fulfill a single write request. (ii) read latency is the number of milliseconds required to fulfill a read request [108]. In parallel databases, every lock operation in a processor introduces a lot of latency. Overload Detection is also desired to restrict the response time of a server to be below a certain threshold to retain customers [109]. Nevertheless, if many user transactions are executed concurrently, a database system can be overwhelmed. As a consequence, computing resources such as CPU cycles and memory space can be exhausted. Also, because of data contention, most transactions can be blocked or prematurely ended and restarted. In terms of physical space and query processing time, cost models are useful in analyzing the cost [110]. Cost models provide the inputs (data distribution statistics and operator execution algorithms in a query plan) needed to estimate the cost of query plans and thereby facilitate the selection of optimal plans [111,112]. An important concept underlying the execution of a query plan is that it takes fewer resources and/or minimum time to process fewer data [113]. What series of operations require fewer data and time for request execution is not always clear. Thus, the cost model provides some real insight into a database system's actual physical design and performance.

CONCLUSION
The study focused on the importance of relational model merits to the big data. Potential determinants (e.g. data storage layouts and data retrieval techniques) related to the relational model were also highlighted. The paper mainly contributed to the literature on the storage layouts, indexing, query processing, and concurrency control techniques in the relational model. The study revealed there are a few key areas that still need to be worked upon in the relational models. The identification of gaps in indexing has led to an understanding that it is not straightforward to index the data being distributed and accessed on a large scale to allow efficient point lookups. Faster point access to these objects is still an open issue. Efficient indexing algorithms have been explored, but as the RUM conjecture states, any two optimizations lead to the third being reduced. Con-currency in many-core systems is still a matter of research. In the case of MVCC, the overhead grows with each concurrent update.
The paper also highlighted the on-going research on hybrid transactional and analytical processing (HTAP). While there are many systems classified as HTAP solutions out there, none of them support true HTAP. Current frameworks provide a reasonable platform to support transactional as well as analytical requests when sent separately to the system. Nonetheless, none of the existing solutions support effective transactional and analytical request processing within the same transaction. However, today most HTAP systems use multiple components to provide all the desired capabilities. Different groups of people usually maintain these different components. It is, therefore, a challenging task to keep these components compatible and to give end-users the illusion of a single system. Last, the study also proffered the acceptance and implementation of a few significant methodologies of the relational model in big data. The present system demands for data to be both "offensive" (data being exploratory) and "defensive" (control of data).