Big Data Challenge

Understand Big Data Problems and How to Solve it

In a article we have learned about the key characteristics of Big Data. It is now time to understand the challenges or problems that come with big data set. So in this lesson, let’s take a sample big data challenge, analyze it, and see how we can arrive at a solution together. Ready?

Example of big data challenge

Imagine you work at one of the major exchanges like New York Stock Exchange or Nasdaq and one morning someone from your risk Department stops by your desk. He asked you to calculate the maximum closing price of every stock symbol that is ever traded in the exchange since inception, also assume the size of the data set you were given as one terabyte.

So your data set would look like an excel sheet with rows of data.

Each line in this data set is an information about a stock for a given date. Immediately the business user who gave this problem ask you for an ETA (estimated time arrive) and when he can expect the results. Wow! There is a lot to think here. So you ask him to give you some time and you start to work.

What would be your next steps?

You have two things to figure out. The first one is storage and second one is computation. Let’s talk about storage first. So your workstation has only 2GB of free space but the size of the data set is one terabyte.

So, you go to your storage team and ask them to copy the data set into a server or even a NAS server (Network Attached Storage) or SAN (Storage Area Network). Once the data set is copied, you ask them to give you the location of the data set. o an answer a SAN is connected to your network.

Any computer with access to the network can access the data providing if their permission to see the data. So for good, the data is stored and you have access to the data.

Now you set out to solve the next problem which is computation.

You’re a Java programmer, so you wrote an optimized Java program to parse the data set and perform the computation. Everything looks good and you are now ready to execute the program against the data set.

You realize it’s already, the business user who gave you this request stop by for an ETA. That’s an interesting question isn’t it?

You start to think what is the ETA for this whole operation to complete and you come up with the result set for the program.

To work on the data set, first the data set needs to be copied from the storage to the working memory or RAM.
So how long does it take to copy a one terabyte data set from storage? A stakeout traditional hard disk drive that is connected to a laptop or workstation.

Hard disk drive has magnetic platters in which the data is stored. When you request to read data, the head in the hard disk first position itself on the platter and start transferring the data from the platter to the head. The speed in which the data is transferred from the platter to the head is called the data access rate.

This is very important! So read carefully. The average of write data access rates in hard disks is usually about 122 megabytes per second.

So if you do the math, to read a 1 terabyte file from a hard disk drive, you need 2 hours and 22 minutes. That is for a HDD that is connected to your workstation when you transfer a file from a server or from your SAN. You should also know the transfer rate of the hard disk drives in the NAS.

For now, we will assume it is same as the regular HDD which is 122 megabytes per second and hence it would take 2 hours and 22 minutes.

Now what about the computation time?

Since you have not executed the program yet at least once, you cannot say for sure, plus your data comes from a storage server that is attached to the network. You also have to consider the network bandwidth. So, with all that in mind you give him an ETA of about 3 hours, but it could be easily over 3 hours since you’re not sure about the computation.

Your business user is so shocked to hear that would take three hours for an ETA. So he has the next question. Can we get it sooner than three hours? Say maybe in 30 minutes.

You know there is no way you can execute the results in 30 minutes. Of course the business cannot wait for three hours especially in finance for time is money right?

So let’s work this problem together.

How can we calculate the result in less than 30 minutes?

Let’s break this down. Majority of the time you spend in calculating the result set will be attributed to two tasks. First is transferring the data from storage, hard disk drive which is about two and a half hours and the second task is the computation time.

That is the time to perform the actual calculation by your program. Let’s say it’s going to take about 60 minutes. It could be more or, it could be less.

I have a crazy idea. What if we replace HDD by SSD. SSD stands for solid-state drives. SSDs are very powerful alternative for HDD. SSD does not have magnetic platters or heads. They do not have any moving components and it’s based on flash memory. So it is extremely fast.

Sounds great right?

So why don’t we use SSD in place of HDD? By doing that we can significantly reduce the time it would take to read the data from the storage. But here is the problem.

SSD comes with a price. They usually five to six times higher in price than your regular HDD, although the price continues to go down. Given the data volume that we are talking about with respect to big data, it is not a viable option right now.

So for now we are stuck with hard drives.

Let’s talk about how we can reduce the computation time. Hypothetically, we think the program will take 60 minutes to complete also assume your program is already optimized for execution.

So what can be done next? Any ideas?

I have a crazy idea. How about dividing the one terabyte data set in two hundred equal sized chunks or blocks and have hundred computers, or hundred nodes to the computation parallel. In theory, this means you cut the data access rate by a factor of hundred and also the computation time by a factor of 100.

So with this idea, you can bring the data access time to less than two minutes on computation time in less than one minute.

That sounds great! It is a promising idea, let’s explore even further.

There are a couple of issues here.

If you have more than one chunk of your data set stored in the same hard drive you cannot get a true parallel read because there is only one head in your hard disk which does the actual read. But for the sake of the argument, let’s assume you get a true parallel read which means you have hundred nodes trying to read data at the same time.

Now assuming the data can be read parallel, you will now have 100 times 122 megabytes per second of data flowing through the network.

Imagine this, what would happen when each one of your family member at home starts to stream their favorite TV show, an hour movie at the same time using a single internet connection at your home.

It would result in a very poor streaming experience with a lot of buffering. No one in the family can enjoy their show right?

What do you have essentially done is choked up your network. The download speed requested by each one of the device’s combinely exceeded the download speed offered by the internet connection resulting in a poor service.

This is exactly what will happen here when hundred nodes trying to transfer the data over the network at the same time.

So how can we solve this?

Why do we have to rely on a storage which is attached to the network? Why don’t we bring the data closer to the computation nodes? That is why don’t we store the data locally in each nodes hard disk.

So you would store block 1 of data and node 1 block 2 of data and node 2 etc.

Something like this.

Now we can achieve a true parallel read on our 100 nodes and also we have eliminated the network bandwidth issue.

Perfect! That’s a significant improvement or design. Right?

Now let’s talk about something which is very important. How many of you have suffered data loss due to a hard disk failure? I myself have suffered twice. It is not a fun situation right?

I’m sure most of you at least once faced a hard drive failure. So how can you protect your data from hard disk failure or data corruption etc?

Let’s take an example

Let’s say you have a photo of your loved ones and you treasure that picture in your mind. There is no way you can lose that picture.

How would you protect it you would keep copies of your picture in different places right? Maybe one in your personal laptop, one copy in Google cloud one copy in your external hard drive.

You get the idea right?

So, if your laptop crashes, you can still get that picture from the cloud or your external hard drive.

So let’s do this. Why don’t we copy each block of data to two more notes? In other words, we can replicate the block in two more nodes. So in total we have three copies of each block.

If a node one has block 1, 7 and 10. Node two has blocked 7,11 and 42. Node three has blocks 1, 7 and 10.

So if block 1 in node one is unavailable due to a hard disk failure or corruption in the block, it can be easily fetched from node three.

This means that node one, two and three must have access to one another and they should be connected in a network.

Conceptually this is great but there are some challenges implementing it.

Let’s think about this. How does node one knows that node three has block 1? Who decides block 7 for instance, should be stored in node one two and three?

First of all, who will break the one terabyte in two hundred blocks?

As you can see this solution doesn’t look that easy isn’t? It and that’s just the storage part of it. Computation brings other challenges. Note one can only compute the maximum closed price from just block one.

Similarly, node two can only compute the maximum closed price from block two. This brings up a problem because for example, data for stock GE can be in block one and can also be in block two and could also be unblocked eighty two for instance right?

So you have to consolidate the result from all the nodes together to compute the final result.

So who’s going to coordinate all that?

The solution we are proposing is distributed computing and as we are seeing there are several complexities involved in implementing the solution both at the storage layer and also at the computation layer.

The answer to all these open questions and complexities is Hadoop. Hadoop offers a framework for distributed computing. Hadoop has two core components which are HDFS and MapReduce.

HDFS stands for Hadoop distributed file system and it takes care of all your storage related complexities like splitting your data set into blocks, replicating each block to more than one node, and also keep track of which block is stored on which node etc.

MapReduce is a programming model and Hadoop implements MapReduce and it takes care of all the computational complexities.

So Hadoop framework takes care of bringing all the intermediate results from every single node to offer a consolidated output.

So what is Hadoop?

Hadoop is a framework for distributed processing of large data sets across clusters of commodity computers. The last two words in the definition is what makes her even more special, commodity computers. That means all the hundred nodes that we have in the cluster does not have to have any specialized hardware. They are enterprise grade server nodes with the processor, hard disk and RAM in each of them.

Don’t confuse commodity computers with cheap hardware. Commodity computers mean inexpensive hardware and not cheap hardware.

Now you know what Hadoop is and how it can offer an efficient solution to your maximum closed price problem against a one terabyte data set.

Now you can go back to the business and propose Hadoop to solve the problem and to achieve the execution time that your users are expecting.

If you propose a hundred node cluster to your business, expect to get some crazy looks but that’s the beauty of hurdle. You don’t need to have a hundred node cluster. We have seen successful Hadoop production environments from small ten node cluster all the way to hundred, two thousand node cluster.

You can simply even start with a ten node cluster and if you want to reduce the execution time even further, all you have to do is add more nodes to your cluster. That’s simple, in other words her loop will horizontally scale.