Spark RDD vs Spark SQL Performance comparison using Spark Java APIs

Resilient Distributed Dataset (RDD) is the main abstraction of Spark framework while Spark SQL (a Spark module for structured data processing) provides Spark more information about the structure of both the data and the computation being performed, and therefore uses this extra information to perform extra optimizations. Up until Spark 1.6, RDDs used to perform …

Driven by Big Data – Governance & Security

My definition of this topic is “a necessary component of any big data solution with a perfect blend of skepticism and confidence by all of those involved”. Think of this as not just securing a file with permissions and capturing its name and glossary, but about sustainable architecture designs. Our recent blog “Make Data your BAE …

Driven by Big Data – Beyond Enterprise Search

Driven by Big Data – Beyond Enterprise Search Enterprise Search is yet another rapidly growing ecosystem. Combining state of the art technologies to work together has redefined what “search” meant to an organization and its global customers. Being one of the prominent themes in the services business, building a matured knowledge management architecture has always …

Observer Design Pattern – Java

When you write Data Validations, Credit Card Validations, Phone Number Validations, and String Validation, we mostly choose 3rd party Common Utilities, mainly the library from Apache. This is to prevent the wheel reinvention; technically we call it “CODE REUSE” or just “REUSABILITY.” When it comes to Design Patterns, the way I think of it is, it’s  still …

Driven by Big Data – IT Transformation Strategy

IT transformation is inevitable, and the technology refresh cycle is becoming more and more aggressive and competitive. Open source has not only gained trusts of public sector enterprises but also into more regulated businesses and organizations. CIO’s office is constantly pushing for more innovative ideas, cost savings, and auditing their existing systems. Their guiding principles focus on evaluating open …

Driven by Big Data – Blockchain and Device Democracy

The whole concept of a decentralized distributed database, a shared ledger, and a singleton computation framework makes Blockchain one of the most prominent technological discoveries of today. It is easy to relate Blockchain to Bitcoin as it was an early adopter and a starter for various organizations to conduct proof-of-concepts and build private networks of …

Driven by Big Data – Design Patterns

Big Data ecosystem is a never ending list of open source and proprietary solutions, and in my view, nearly all of them share common roots and fundamentals of good old platforms that we grew up with. With that as the basis, our topic for today is about architecture and design patterns in the Big Data …

My First Week in Big Data

Musings of a Java Dude Written By: Vinodh Thiagarajan, Sr. Java Consultant I am a developer and spend all of my time with non-big data items. That is why I felt somewhat lost when I entered the Intersys premise. My mission is to become a Hadoop Developer within a short time period along with a …

Processing and Serving Data with Apache Spark

Written by: Edward Yeh, Principal Big Data Consultant So what is Apache Spark and why do we care? Spark is a fast and general-purpose cluster computing system that is used for large-scale data processing of both structured and unstructured data. The project was initially developed by the AMPlab at UC Berkeley and has now evolved …
older posts