What is Data Ingestion?

 

Businesses today gather large volumes of data, both structured and unstructured, in an effort to use that data to discover real-time or near real-time insights that inform decision making and support digital transformation.

Data ingestion is defined as the process of importing, transferring, loading and processing this data for later use or storage in a database. It involves connecting to various data sources, extracting data, and detecting changes in data. Data ingestion subsystems must fetch data from a variety of sources, such as RDBMS, web logs, application-logs, streaming data, social media sites, etc.

Data is ingested in three different methods: batch, real-time and streaming.

Batch data is an efficient way of processing large volumes of data where a large set of transactions collect over time. Data is collected, entered, and processed and then batch results are produced with tools like Hadoop.

Real-time data, on the other hand, demands continual input, process and output of data. Data must be processed in a small time or period (or near real-time). It is the process of moving data in to big data systems as and when they arrive.

With steaming data, data is immediately processed off of incoming data flows. This type of ingestion is often used for predictive analytics.

data ingestion example

Challenges to Data Ingestion

 

Gathering data from multiple sources and in different forms for business use cases presents a variety of challenges in data ingestion. These challenges can include: multiple source ingestion, managing streaming/real-time data, speed of ingestion, and change detection.

Multiple Source Ingestion — This is a challenge when you have to manage and decide which data to include in your data warehouse. Organizations can generate an enormous amount of data during any given product or service life cycle. These datasets can include customer data, vendor data, product data, and asset information just to name a few.

Managing Streaming/Real-time Data – This type of ingestion challenging occurs when managing data coming from sources like log files, eCommerce purchases, or information from social networks. Data stream management Systems (DSMS) have been developed to help manage continuous data streams. They resemble database management systems (DBMS), however, instead of being designed for static data, a DSMS executes a continuous query that is not only performed once, but is permanently installed. Since most DSMS are data-driven, a continuous query produces new results as long as new data arrive at the system.

Speed of Ingestion – Data sources deliver data at varying frequencies. For example, comment forums amount to large data sets but occur at a low frequency. However, information such as tweets can be seen as small volumes of data that occur at a high frequency and require more rapid ingestion. A number of platforms now exist to process big data, including advanced SQL (sometimes called NewSQL) databases that adapt SQL to handle larger volumes of structured data with greater speed, and NoSQL platforms that may range from file systems to document or columnar data stores that typically dispense with the need for modeling data.

Change Detection – The detection and capture of change data is a big problem in data ingestion. This problem has been extensively studied for scalar or multivariate datastreams and has been largely left unattended in the Big Data scenario.

Tools for Data Ingestion

 

There are many tools available today for data ingestion but below we’ll preview 5 of the most popular.

Apache Flume – Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Apache Sqoop – Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Netezza, Oracle, Teradata, Postgres, MySQL, and HSQLDB

DataTorrent – DataTorrent is built on top of Hadoop 2.0 and allows companies to process massive amounts of data in real time. It offers DataTorrent RTS, a production-grade platform that is designed to ingest, transform, and analyze every data type that is generated in real-time to give users the insights they need. The company’s platform can be deployed on-premises or in the cloud.

Amazon Kinesis – Amazon Kinesis is an Amazon Web Service (AWS) for processing big data in real time. It is capable of processing hundreds of terabytes per hour from large volumes of streams data sources like financial transactions, operating logs, and social media feeds.

Gobblin – Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google and more.

 

At Intersys, our large group of highly skilled consultants are exceptionally qualified in many different types of data ingestion tools and use cases. With our expertise, we can implement or support all of your big data ingestion requirements and help your organization on its journey towards digital transformation.