Computing in its
purest form, has changed hands multiple times. First, from near the beginning
mainframes were predicted to be the future of computing. Indeed mainframes and
large scale machines were built and used, and in some circumstances are used
similarly today. The trend, however, turned from bigger and more expensive, to
smaller and more affordable commodity PCs and servers.
Most of our data is
stored on local networks with servers that may be clustered and sharing
storage. This approach has had time to be developed into stable architecture,
and provide decent redundancy when deployed right. A newer emerging technology,
cloud computing, has shown up demanding attention and quickly is changing the
direction of the technology landscape. Whether it is Google’s unique and
scalable Google File System, or Amazon’s robust Amazon S3 cloud storage model,
it is clear that cloud computing has arrived with much to be gleaned from.
Hadoop Archives
HDFS stores small files inefficiently, since
each file is stored in a block, and block metadata is held in memory by the
namenode. Thus, a large number of small files can eat up a lot of memory on the
namenode. (Note, however, that small files do not take up any more disk space
than is required to store the raw contents of the file. For example, a 1 MB
file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)
Hadoop Archives, or HAR files, are a file archiving facility that packs files
into HDFS blocks more efficiently, thereby reducing namenode memory usage while
still allowing transparent access to files. In particular, Hadoop Archives can
be used as input to MapReduce.
Programming
Model
The computation takes
a set of input key/value pairs, and produces a set of output key/value pairs.
The user of the MapReduce library expresses the computation as two functions:
Map and Reduce. Map, written by the user, takes an input pair and produces a
set of intermediate key/value pairs. The MapReduce library groups together all
intermediate values associatedwith the same intermediate key I and passes them
to the Reduce function. The Reduce function, also written by the user, accepts
an intermediate key I and a set of values for that key. It merges together
these values to form a possibly smaller set of values. Typically just zero or
one output value is produced per Reduce invocation. The intermediate values are
supplied to the user's reduce function via an iterator. This allows us to
handle lists of values that are too large to fit in memory.
0 comments: