Teradata was created by Teradata Corp. in 1976 and is considered as one of the best Relational Database Management Systems available today. It makes an interesting read to understand how Teradata has been the leader in the market for so many years. In this blog, we’ll cover Teradata’s unique architecture to learn why so many people have found the tool useful for so many years.
What is Teradata?
Teradata is a Terabyte scale Relational Database Management System (RDBMS) catering to data warehousing needs and offers high speed, high efficiency, and multi-user access by utilizing the concept of ‘Parallelism’. Teradata is based on Massively Parallel Processing (MPP) Architecture. An MPP architecture follows a “Divide and Conquer” approach which means that to execute a task, Teradata’s processor (read Teradata’s Brain) divides the task evenly across the entire system. We will discuss MPP architecture in detail in the following sections.
Important Features of Teradata
Some of the salient features of Teradata make it a popular Data Warehousing product. The following are the most important features of Teradata:
Shared Nothing Architecture: Teradata’s architecture is called Shared Nothing Architecture. It means that each component in Teradata architecture works independently and nothing (disk storage, processing unit) is shared between each independent component.
Linear Scalability: Linear Scalability means that if you add more components to your system, the system performance increases linearly. Adding more hardware components means you can achieve the same throughput even with the additional workload on your Teradata System. Teradata can scale from 100 gigabytes to over 100+ petabytes of data without any performance degradation. The graph below shows the linear scalability feature of a Teradata System:
Efficient Data Export Utilities: Teradata offers a custom solution to data loading and unloading use-cases. Teradata has utilities like TPump, FastLoad, FEXP, TPT, and MultiLoad which performs data transportation in and out of a Teradata System at a very high speed.
Mature Optimizer: Teradata’s Optimizer is one of the most mature optimizers in the market. Since its inception, it has been designed to support parallel computing. It can handle 64 joins in a query.
One acronym that you will come across multiple times while reading about Teradata’s Architecture is MPP; i.e., Massively Parallel Processing. It is imperative to know more about an MPP system before diving deep into Teradata’s architecture. MPP is a computing concept where a task is broken up into multiple subtasks and a coordinated processing of these subtasks is performed by multiple processors, with each processor using its own memory and operating system. The advantages of Massively Parallel Processing are that MPP systems are generally faster and more efficient. Also, an MPP System is cost-effective (especially for Data Warehousing use cases) and is easy to deploy.
Now that we know a little bit more about an MPP system, let’s talk about Teradata’s much-celebrated Architecture. Teradata’s Massively Parallel Processing (MPP) Architecture is based on three core components:
- Parsing Engine (PE) or Optimizer
- Access Module Processor (AMPs)
- Messaging Layer (BYNETs)
A logical architecture is shown in the following diagram highlighting these components:
Let’s look into these three cornerstones of Teradata Architecture in more detail.
Parsing Engine (PE): Whenever a user connects to Teradata, it actually establishes a connection with the Parsing Engine. As the diagram shows, a Parsing Engine is further divided into three components:
- Parser: A parser receives the user query and does the syntax of the submitted query. Next, the parser checks if the objects are present in the database or where the query is, making a reference to an invalid database object. Thirdly, it checks if the user who has requested the SQL query has the appropriate privileges or not; i.e., it does the Privileges Check. Lastly, the parser generates an SQL token and passes the control to the optimizer.
- Optimizer: Teradata’s optimizer is a cost-based optimizer. There are two inputs to the Optimizer: One is the SQL token from the parser and the second is the metadata information from the system. Based on these inputs, the optimizer will come up with different plans. Each plan has its own cost. Now, the obvious question arises: What is a cost? Cost is a summed up numeric value of the time to execute a plan and resources (CPU Cycle, Internal Resources) that will be used by the plan. A plan with the least cost is selected by the optimizer and sent to the dispatcher.
- Dispatcher: Dispatcher controls the sequence of the execution steps and passes the information received by the optimizer to the AMP through a Message Passing Layer.
Messaging Passing Layer (BYNET): As the name suggests, Message Passing layer (aka BYNET) is the networking layer in Teradata architecture which is responsible for the communication between other components. BYNET has both a physical component and a software component. BYNET’s physical component, called Enterprise Serial Bus, acts as a connector and the software component enables the communication. BYNET supports 3 kinds of communication viz. point-to-point, multi-cast and, broadcast communication.
- Point-to-Point Communication: Message is routed to a single AMP
- Multi-cast Communication: Message is routed to multiple AMPs
- Broadcast Communication: Message is routed to all AMPS
Parsing Engine dispatches the SQL queries to the AMPs (Access Module Amplifiers) through BYNET and AMPs return the result to the user through BYNET. In a Teradata system, there are always 2 BYNETs called BYNET 0 and BYNET 1. The reason there are two BYNETs in Teradata is that:
- If one BYNET fails, the other can ensure the communication between the PE and AMPS
- It increases the efficiency of the Teradata System as communication is faster between the PE and AMPs
AMP (Access Module Processor aka VProc): AMP stands for Access Module Processor and is also known as VProc (Virtual Processor). AMPs are basically the database execution engine and these are multiple in number. It listens to the Parsing Engine through the BYNET and returns the result set to the user through BYNET.
Each AMP in a Teradata system has its own memory disk attached to it called VDISK as shown in the architecture diagram previously referenced above. VDISK stores the data rows and Teradata splits a big table into multiple small subsets and distributes all rows to all VDISKs evenly. Thus, Each AMP manages a portion of the database and a portion of tables. Each AMP is responsible for managing and working with its own set of data and in this way the highest level of Parallelism is obtained. Each AMP works with its own VDISK and nothing (no data, no memory) is shared between the AMPs. This is known as Shared Nothing Architecture.
As we’ve now observed, the fundamental principle behind Teradata’s ability to ingest Petabyte scale data and process it at a high speed is what has made it a market leader for so many years. Understanding Teradata’s architecture is very important for Developers, Application Designers and Solution Architects alike to ensure efficient querying and to avoid issues like skewed distribution of data and spool space issues. Remember, to build a cost-effective, highly scalable and efficient application on Teradata, a solid understanding of its unique architecture is a must.