What is data ingestion?

Data ingestion is moving data especially unformatted data from different sources into a system where it can be stored and analyzed by Hadoop.

Types of data ingestion:

  • Real-time Streaming
  • Batch Data Ingestion

 

Real-time Streaming

Here from any GUI, we can ingest the events in the form of Kafka messages, Kafka producer is the place where we define topic which will generate the messages for a configured event. Those messages can be consumed inside spark structured streaming job using below code and after transforming those we can sink it on either S3 or HDFS

 

Sample Code for this structured streaming –

a) Method to generate DS by reading the Kafka topic (Messages) –

b) Method to write output data –

c) Method to transform the kafka message –

d) Run method in full –

Technology Stack used –

Apache Kafka

Apache Spark

Amazon S3

HDFS

Batch Data Ingestion

In batch data ingestion it includes typical ETL process where we take different types of files from specified location to dump it on any raw location over HDFS or S3.

Staging is one more process where you store the semi-processed data e.g. any de-duplication will happen here, it’s kind of cleaning the data and store it in semi-transformed.

Finally, in the core, we store data which is fully processed or stage data on which all business logic applied.

For such data ingestion from databases, we have Apache Sqoop which connect to any database and ingest data on S3 or HDFS, for files we can have any cron job or something to pick up the files from specified FTP server or specified location.

 

a) Sqoop command

 

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *