The continuing growth of massive and diverse data volumes, and the
growth of data intensive applications, has presented a need to find
effective means of data management across all sectors. According to a recent report,
businesses face a huge skill gap in the management of big data, with
the gap growing from 400 in 2007 to 4,000 in 2012 in the United Kingdom
alone. In addition to this, there is a general lack of understanding
among students of current data analytics processes, which are becoming
extremely important for future challenges with the growth of the
Internet of Things (IoT) and real-time data.
As a computer scientist, studying and building modeling and
simulation applications, I was initially perplexed as to the attraction
towards the term big data. Business seems to focus on Hadoop-related
software for data analytics, and having Hadoop-related projects on your
resume can be a bonus. As a teacher of cloud computing and software
engineering, I decided to assign two students Hadoop-related projects
for big data management with a "smart cities" focus, and interviewed
them about their learning objectives to see what they thought about the
technologies.
As a prerequisite, the students were given full freedom to examine
the topic of Hadoop big data processing, and asked to explore whichever
tools they wanted to in this area. Hadoop is a set of tool that supports
the running of big data applications with multiple job executions to
allow massive amounts of data to be processed quickly. It is an
environment to run MapReduce
jobs that are usually sorted in batch. Hadoop has become one of the
most important tools in science projects which require analyzing data.
Some of the Hadoop-related tools my students investigated included:
- Apache Ambari: A framework for managing and monitoring Hadoop clusters
- Apache Pig: A platform for running code for analyzing large data set of data.
- Apache Sqoop: A tool used for moving data between Hadoop and other data stores
- Apache ZooKeeper: A tool used for providing synchronization and maintaining the set up of information.
- Apache Spark: A newer tool used to run analysis faster on some types of data.
- Apache Flume: A system that gathers information that is later stored in HDFS.
- Apache Hive: A tool which allows users to use a SQL-like language to analyze data.
- Apache Oozie: A tool to start analysis jobs that have been broken into different parts in the correct sequence.
- Hadoop Distributed File System (HDFS): A framework for dividing data between nodes.
- HCatalog: A tool which is used to upload tables and is used to manage data, which enables users to analyze data using different processing tools like Pig, Hive, and MapReduce.
After the students successfully finished their final year
dissertations, I asked them some questions to understand what they
learned from the experience. Here are the responses from both of my
students, Saudamini Sonalker and Rafiat Olubodun Kadiri, who were doing
independent experiments with Hadoop.
Why did you want to learn Hadoop? Just to learn something new, or were you influenced by industry interest in the project?
Saudamini: I was primarily motivated to work on this
topic after having read a book about big data by Victor
Mayer-Schonberger and Kenneth Cukier: Big Data: A Revolution That Will Transform How We Live, Work, and Think.
The predictive nature of tools that assist big data processing is what
drew me to learning more about it. Concentrating on smart city data was
also an interesting element of this project. I want to learn and
understand more about how city data can be utilized to make cities
efficient, green, and smart.
Rafiat: I chose the topic of Hadoop because it is a
new area; it is a buzz word, and recently has been dominating the
market. Different businesses make use of it, including social media
websites such as Twitter and Facebook using Hadoop to mine data for
different purposes, enabling them to make reasonable business decisions.
What do companies use big data for? What kind of questions are they using it to ask?
Saudamini: Companies use big data for numerous
purposes. Amazon utilizes it for recommendations, Skyscanner and Kayak
for adjusting flight prices by monitoring an individual's past searches,
and Google uses it to determine the order of search results. An
interesting use of big data was Amsterdam's Energy Atlas
project. It used energy consumption data from within the city to
promote renewable energy by making its citizens aware of their own
usage.
Rafiat: Different companies have different use of
big data. The usage of big data for a company depends on what type of
service they provide to the public. Businesses like eBay and Amazon use
big data to make predictions of what customers may desire according to
their previous purchase history and similar purchase by other customers
What problems did you have when installing Hadoop while setting up the sandbox environment? What led you to choose Hortonworks Sandbox for your experiments?
Saudamini: I explored a couple of options before
deciding on Hortonworks Data Platform. The major reason for choosing it
was because it is open source and free. Other competitors like MapR,
Amazon Web Services and Cloudera, however good the platforms, were
expensive. However, there were strict memory requirements to set up the
sandbox. A 64-bit processor was necessary to access the sandbox via
virtual machine, and it required at least 4GB RAM. This slowed the
process down for me and the platform has no flexibility in terms of
requirements.
Rafiat:
There are quite a number of public Hadoop clusters that have been
designed for storing and analyzing large amounts of unstructured data in
a computing environment. They are available on cloud infrastructures
such as Heroku, Hortonworks Sandbox, Azure, and others.
After a few searches, I decided to use Hortonworks Data Platform, an
open source apache Hadoop data platform. The system requirements
included using Windows or Mac operating system, at least 4GB of RAM, a
virtual machine environment, and a 64-bit chip that supports
virtualization.
The first step was to download a virtual machine, then download the
sandbox from the Hortonworks website. After this I connected to the
sandbox with the given IP address.
There were some negative aspects to using the Hortonworks sandbox for
research, which I still face. I was unable to access the sandbox with
the given IP address for a while, but after multiple trials, it worked.
Second, the virtual machine slowed down my computer the moment it is
switched on, and it took a long time for a query to load.
Further, I face issues like when my machine goes off itself without
allowing me to shut down the virtual machine down myself, the next time I
switch it on, the virtual machine comes up with configuration errors
which restricts me from accessing the sandbox. Another issue that I face
is not being able to access some of the tools sometimes, which slows
down my research.
How does the Hortonworks Data Platform work?
Saudamini: The platform can be divided into three
layers: the data access layer, cluster resource management, and HDFS.
The data access layer is where the user uploads, catalogues, and manages
data; one uses this layer to enter their Hive/Pig jobs for the system
to perform. Cluster resource management (YARN) is an architectural hub
for data processing engines so multiple applications can be run on the
HDFS. This layer essentially works as a translator for the other two.
Finally, HDFS is where the MapReduce jobs are run in parallel between
the master and slave nodes.
Ambari is a web-based GUI that can speak to the underlying machinery and allows user to set up and manage a Hadoop cluster.
Rafiat: When accessing the sandbox, I was directed
to a page where I had access to different tools like Hive, File browser,
Pig, Job browser, and others. I could upload different type of files
(zip file, csv, xml), then create tables from tools like Hive, Pig and
HCatalog with the file that has been uploaded through the file browser
icon. I could then create queries to provide different type of tables
with different criteria to fit a requirement.
Ambari can be used to monitor and manage Hadoop clusters. Monitoring
the outcome of the queries that have been carried out, and showing the
effect of the queries on the CPU usage, memory usage, network usage,
etc.
What tools did you explore, and what were the new things you learned in the process?
Saudamini: Initially, I planned on exploring Pig and
Hive, but I had issues running the Pig script on Hortonworks Sandbox
and hence stuck with Hive. Hive Query Language is very similar to SQL,
therefore if someone is proficient in the latter, then they shouldn't
have an issue working with the tool. On Hortonworks Sandbox, Hive has a
graphical user interface called Beeswax. Hive converts queries you write
into MapReduce jobs. Whether or not one needs multiple options to
process data depends on the skill sets of the users working on a large
project. Hive diminishes the need to train or hire external resources in
order to fill in the gap. The flexibility is useful in scenarios like
those.
Rafiat: I used Hive, which uses an SQL-like
scripting language which is known as HiveQL. It is suitable for users
that are familiar with structured query language. Additionally, Pig was
used as a language for data analysis and it is also a high level
processing layer on Hadoop. It consist of a language called Pig Latin.
What kind of files did you process? Smart city datasets?
Saudamini: I concentrated on smart city data, specifically London traffic and social data.
Rafiat: Smart city data were used for this
experiments most of the data was retrieved from ITU data statistics
website and London data store website.
What were the goals of the experiments? What did you achieve?
Saudamini: The goal was to observe performance of
the underlying machinery and cluster loads. After processing different
big data files I compared results of CPU performance, cluster loads,
memory usage, and network usage.
Transport and social data was processed on the platform to check the
feasibility of implementing smart offices within London to reduce
traffic and save people's time. The hypothesis was that there would be a
correlation between high traffic boroughs and boroughs with most work
destinations. Although that held up in most cases, these boroughs were
not in central London like initially imagined.
Rafiat: The goal of the experiment was to analyze
set of data that will be retrieved from different sources like ITU
(International Telecommunication Union) website, London data store,
public data sets on Amazon Web Services, etc. The aim was to use volume
as one the criteria to consider while analyzing the data. By doing this,
the experiment will be able to show how long it will take for the data
to be processed.
If you were given a project now for big data processing, how would you approach it?
Saudamini: If time is not a concern and price is an
issue then I would recommend using Hortonworks Sandbox as its
flexibility towards type of data source, data processing tool options
and Ambari environment give a wholesome data management experience.
However, if time is of the essence and money not a factor then it would
be beneficial to look at other options which provide a similar user
experience in the cloud.
Rafiat: I would use of Hortonworks Data Platform on a
separate machine dedicated to the platform, as my own machine was not
very high spec.
As a computer science student, do you think for data management we should always use tools like these?
Saudamini: If the dataset you are working with it
large, then I think it is advisable to use big data tools like these.
Their flexibility and quick processing make them ideal to be deployed as
solutions to smart city issues. However, I am not convinced that we
should always use them. We could actually try and avoid using these
tools if the dataset doesn't demand it. A lot of the analytical
functions can be done by other BI tools. Big data tools can have a steep
learning curve, and training users should be factored in while
deploying systems that utilize them.
Rafiat: Data management is a very important topic
There are different advantages to managing data effectively as a
student, individual or organization. This includes preventing data
duplication, which will allow memory space to be saved. It allows
validation of results if need be. Data management allows proper
understanding of data, the use of queries to provide specific
information needed, so data can be understood easily.
In conclusion, we got mixed results on the use of tools to process ig
data applications. An open Hadoop data platform seemed like the obvious
choice at the time. As previously described, MapReduce is at the core
of a Hadoop Distributed File System. Hortonworks Sandbox is equipped
with YARN, the second generation of MapReduce. It divides the two
important tasks and makes the process more efficient. YARN supports
batch as well as real-time processing projects. The Hortonworks Data
Platform has the capability to adapt to the user’s existing data
architecture which is a huge plus. In addition to the platform being
cost-free, efficient and adaptable, it also has an extensive list of
tutorials and user based guides on using the services it provides.
There are a lot of big data processing platforms available as a
result of it being the current buzzword. Most services; Amazon Web
Services, Cloudera, MapR etc. to name a few, charge the user depending
upon the traffic and amount of data they process. Cloudera’s website
claims, "The company’s enterprise data hub (EDH) software platform
empowers organizations to store, process and analyze all enterprise
data, of whatever type, in any volume—creating remarkable
cost-efficiencies as well as enabling business transformation."
The current move towards open data generating massive amounts of
data, needs real-time processing needing intelligent solutions to
process it. Having more tools which are open source can fuel further
open data research impacting not only computing, but social sciences,
where economists and governments can make use of big data as well.
Comments
Post a Comment