Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

GC'12 Workshop: First International workshop on Management and Security technologies for Cloud Computing 2012

Incorporating Hardware Trust Mechanisms in


Apache Hadoop
To Improve the Integrity and Confidentiality of Data in a Distributed Apache Hadoop File
System: An Information Technology Infrastructure and Software Approach

Jason C. Cohen Dr. Subrata Acharya


Towson University/Hewlett Packard Company Towson University
Department of Computer and Information Sciences Department of Computer and Information Sciences
Towson, USA Towson, USA
jason.c.cohen@hp.com sacharya@towson.edu

Pairing Apache Hadoop distributed file storage with hardware external hackers, insider threats, as well as legislation requiring
based trusted computing mechanisms has the potential to reduce organizations to take efforts to protect certain data, a need
the risk of data compromise. With the growing use of Hadoop to arises for a better security framework for protecting data within
tackle big data analytics involving sensitive data, a Hadoop HDFS. Although securing a Hadoop environment can be done
cluster could be a target for data exfiltration, corruption, or in a number of ways, the authors’ research will leverage the
modification. By implementing open standards based Trusted Trusted Platform Module (TPM), and technology based on the
Computing technology at the infrastructure and application Trusted Computing Group (TCG) standards, to increase the
levels; a novel and robust security posture and protection is security posture of a Hadoop instantiation. Using these
presented. An overview of the technologies involved, description
technologies, as well as traditional defenses, an IT architecture
of the proposed infrastructure, and potential software
integrations are discussed.
can be formulated that provides increased assurances that data
residing within an HDFS cluster, the state of the supporting
Keywords-component; Hadoop; HDFS; Trusted Computing operating system, and Hadoop software have not been
compromised. Further, by integrating relevant TCG technology
I. INTRODUCTION into the Hadoop application software layer, a more robust
Apache Hadoop, a distributed computation and storage security posture can be achieved. To summarize, we will
system, provides a very compelling framework for present a motivation for the integration of trusted computing in
organizations dealing with “Big Data” and computationally Hadoop, a technology background, an example trusted
intense problems that can fit into the MapReduce framework computing enabled IT architecture, possible software
that it provides. Hadoop works by distributing data and integrations of trusted computing in HDFS, and finally a
processing to a network of “commodity” servers, allowing discussion of preliminary results and real world challenges.
large, computationally difficult data to be stored and analyzed
in an efficient way [1]. For certain application domains, the II. MOTIVATION
Hadoop framework can provide a more efficient and cost Given a distributed Hadoop cloud spanning across
effective approach when compared to other available datacenters, remote partners, or potentially untrusted sites, how
commercial solutions, particularly with the ability to outsource can one be sure that the confidentiality and integrity of the data,
and distribute processing and storage resources. MapReduce, a as well as the integrity of the Hadoop binaries and underlying
programming concept for taking a large problem, breaking it Operating Systems is preserved? In other words, how can one
down into smaller problems and distributing those small units “trust” the platform that they are submitting their sensitive data
and associated data is the core concept of Hadoop [1]. Taking to? How can one be sure that the said data is not altered when
the processing to the data, by executing code where the data the system is offline?
actually resides, eliminates data throughput restrictions of
remotely located or centrally located data. As such, it provides “In a nutshell, Hadoop has strong support for
an efficient and elegant way to deal with operations on large authentication and authorization (via optional Kerberos,
datasets. Unfortunately, security of data within a Hadoop file tokens, and passwords). On the other hand privacy and data
system (HDFS), in transit, and authentication services were not integrity is optionally supported when Hadoop services are
a core concern during Hadoop’s development. There have been accessed through RPC and HTTP, while the actual HDFS
recent efforts to increase the security robustness of Hadoop, blocks are transferred unencrypted. Hadoop assumes network
with much work coming out of Yahoo, a major user and involved in HDFS block transfer is secure and not publicly
contributor of Hadoop [2]. Even with these enhancements, it is accessible for sniffing” [3]. With these limitations, and the
still recommended that Hadoop be limited to use on a “trusted potential of Hadoop to address problem sets involving
network” [2]. With the prevalence of cloud outsourcing,

978-1-4673-4941-3/12/$31.00 ©2012 IEEE 769


sensitive data, including protected information such as health “NameNode”. The NameNode is the master of the HDFS
informatics, defense, and personal data, there arises a need to cluster, and contains a database of the file system namespace,
address the security of a Hadoop IT infrastructure. This mapping logical files to physical blocks in the DataNodes. In
problem escalates when the Hadoop environment is distributed some versions of Hadoop, this is a single point of failure, and
(i.e. between remote datacenters, outsourced). An example in any case, represents the most sensitive part of the HDFS
may be a hospital system that shares a data processing architecture. This database is called the “FsImage”. Whenever
environment with other centers; or a group of universities a HDFS client requests a file or performs other operations, the
where each manages part of a Hadoop cloud within their NameNode orchestrates these request by providing the
respective boundaries. Imagine a Hadoop cloud containing appropriate mappings, which the DataNodes then service by
sensitive patient data protected under HIPPA. This cloud providing the block and CRC to the client. The NameNode
might be used to conduct analytics, insurance processing, also manages block replication. The NameNode manages
medical image analysis, etc. There exists an obligation to changes to this image via a transaction log called the EditLog.
protect this data beyond the standard mechanisms of the This EditLog is periodically justified against the FsImage.
Hadoop framework. Defense systems may contain sensitive or Also, to keep records up to date, a DataNode sends a report to
classified files for analysis. Although these systems may have the NameNode of all the data blocks it contains on each start
physical security and isolation, insider and advanced persistent up [4].
threats remain an issue.
B. The Trusted Computing Group
Although Hadoop makes an effort to ensure data is safe
from corruption and stored in a redundant fashion; like most The Trusted Computing group is a not-for-profit industry
software, it can only be trusted to the extent of the systems and consortium that creates open standards around hardware
architecture that it executes on and the people that have access enabled trusted computing. The hardware enabling trusted
to it; both officially and unofficially. Meaning, a Hadoop node computing is referred to as a Trusted Platform Module (TPM),
compromised by a malicious outsider or insider threat could and is designed as a commodity chip that is integrated into
alter data outside of any checks within the software. This is motherboards from Intel and AMD, as well as appliances such
due to the fact that HDFS is a virtual file system, and actual as network switches, firewalls, and embedded devices [5].
data chunks are distributed across real file systems. Someone In short, the TPM has the ability to enable several key
with access to data from a Hadoop name node, which contains functions. It can securely protect and store keys and data (via
metadata about files, and Hadoop tools and/or source code binding, sealing, or key storage). Binding is encrypting to a
could have the ability to hunt down data in the cluster and alter key that is tied to the TPM’s Storage Root Key. Data Sealing
it. Additionally, one could replace the Hadoop Java packages is requiring Platform Configuration Registers (PCRs) to
with new packages that relay, alter, or otherwise corrupt data. contain certain values for decryption to take place. It can
These attacks can be conducted while a node is offline by a conduct asymmetric key operations on the chip (outside of the
malicious insider, and a system restored without any eyes of even a kernel root-kit or other OS level attacks), it
knowledge of the event. Furthermore, on a distributed provides a “smart-card” like ability to store and protect private
network, an attacker may be able to monitor traffic and keys used in user applications (instead of storing these keys on
conduct man-in-the-middle or replay attacks on the data a disk that could be compromised by the OS), it stores hash
between client and data nodes. The mechanisms provided by values in built in registers called PCRs that represent the state
commercial trusted computing technologies can go a long way of software on the system (reset at reboot), and it can attest to
towards mitigating these possibilities. the state of the system using these values [6]. Although it
provides cryptographic functions, it is not a cryptographic
III. TECHNOLOGY BACKGROUND accelerator, and as a commodity chip, is not fast enough to be
A. The Hadoop File System Overview used as such as commands can take 100ms or more to return a
result. This fact can limit the extent we can rely on the
Securing the data element of a Hadoop infrastructure with cryptographic functions from within a time sensitive
hardware rooted trust is the core idea of the proposed IT application. The TPM 1.2 specification also provides a number
Architecture. The Hadoop Distributed File System is a virtual of monotonically increasing counters, primarily used to
file system, written in Java, which distributes chunks of files prevent replay attacks within the TPM software, and a secure
across “DataNodes”. The chunks, called “blocks”, are stored timer that can also be used for this purpose. It also provides
on the physical file system of each node in a configurable additional NVRAM for the storage of a limited amount of
location. HDFS can be configured to keep redundant copies of additional user created data within the chip [8].
each block, to avoid loss of data in the event of a node (or
multiple node) failure [1]. Hadoop has provisions to be “rack
aware: and can be configured to ensure that data is spread IV. METHODS FOR UTILIZATION OF TRUST COMPONENTS IN
between multiple racks (and potentially multiple geographic HDFS
locations). DataNodes do not know much about the logical
There are many possible solutions to protection of the
files as a whole. Instead, they keep metadata about each
data in a Hadoop storage and processing environment. One
chunk, including a CRC value for the chunk and identifier
such approach in a layered security model is implementing the
information. HDFS also has another role called a

770
idea of “Trust” by utilizing g a hardware, software, and a trusted computing baase. The secondd optional commponent is the
infrrastructure stacck based on thee Trusted Platfform Module anda Trustedd Network Coonnect (TNC) component. This is due to
relaated software stacks. Using these technolo ogies, a two-part the factt that the TNCC protocol and supporting software is still
trusst-based securrity model can c be impleemented; an IT in the developmentaal stage at thhis time. It w would not be
frammework and software integrration. Part onee is a general IT practicaal to assume that a TNC enabled switchh is present,
frammework using g existing tecchnologies to provide certaain particullarly due to a m
mix of vendor support.
assuurances aboutt each membeer participatin ng in a Hado oop
storrage cloud. Th his framework k is not uniquee to the Hado oop
prooblem set; however, it will ad ddress this prob
blem specificallly.
Parrt two is softw ware changes in Hadoop to o utilize Trustted
Com mputing functiions. The two-p part model of trust
t in a Hadooop
IT iinfrastructure can
c be implem mented independ dently, with eaach
layer providing an n increase in assurances
a abouut the state of the
t
plattform.
Fiigure 2. HDFS Noode Trust Chain
A. Layer 1 - Trussted Hadoop IT T Infrastructurre
Trust in the TCG framew work can be built from the t A Chain of Trusst is one of the core concepts of Trusted
harrdware up, with h each verifiedd item becomin ng a building part Compuuting and a keyy enabler to prroviding assuraances that the
in a greater trustted infrastructu ure. Being ablle to start at the
t platform m is in a know wn good state aand has not beeen subverted
lowwest layer, whiile increasing infrastructure complexity, willw [7]. Thhe general iddea is that ann entity (i.e. boot loader)
alsoo provide betteer assurances at the software level. The ideaa is measurres the next piiece of code iin the executioon chain (i.e.
to eensure that each componentt of the operatting environmeent kernel) , this entity thhen measures thhe next piece oof code to be
thatt is relevant to
o the security and integrity of the system m is executeed (an applicaation), etc. Thhis chain of tru rust needs an
measured and verrified against known
k values [7]. Without th his initializzation point thaat is inherentlyy trusted. This is ideally the
“rooot of trust”, onne cannot be sure
s that higheer-level elemen nts firmwaare boot blocck. This booot block codee would be
(such as an app plication, file storage, etc) have not beeen consideered the “Coree Root of Truust for Measurrement”. This
commpromised at a lower level (such as an Operating
O Systeem type off chain of trust is considered a Static Root oof Trust [6].
rooot kit). By imp plementing a measured and d verified systeem Thhere are two options for esstablishment oof the initial
launnch, when Hadoop code is executed,
e it wiill be possible to chain oof trust. The first is implem menting a com mplete Static
knoow if the host Operating Sysstem, key systeem files, and the t Root o f Trust Measuurement, whichh through a coombination of
Haddoop software itself are in a known,
k good state.
s This idea of componnents, is essenntially what wee will define. T TPM enabled
a gground-up chain of trust is not a new co oncept; howev ver, firmwaare is measuredd as part of a rroot of trust, annd is perhaps
impplementation into
i the mainsstream could be described as an undder consideredd place to start while exxamining the
sloww at best but in
ncreasing. securityy posture of a system. Thiss ‘firmware’ iss software as
well annd represents another possiible intrusion point into a
system,, particularly with BIOS ccode base beccoming more
compleex in UEFI baased systems. IIndustry has reecently taken
notice oof the potentiaal for firmwaree compromisess, particularly
in respoonse to a NIST T publication (SP 800-147), published in
2011, which definees standards ffor BIOS prootections. To
reduce the complexiity of the SR RTM model, aan alternative
approacch has been ddeveloped calleed Dynamic R Root of Trust
Measurrement (DRTM M) using speccial hardware ffeatures such
as Intell Trusted Execcution Environm ment [8]. Insteead of relying
on a staatic root of truust from BIOS,, this technologgy makes use
of prottected CPU opperations and dedicated, reseettable PCRs
while eexecuting certaain protected coode [8]. The Liinux TBOOT
Figure 1. Hadoop trusted computting enabled high-llevel IT architectu
ure project makes use of this technologgy, and it has thhe promise of
reducinng the complexxity of establisshing a Trusted Computing
Figure 1 illustrates the components in a Trustted Base (T TCB), howeveer, there has bbeen a demonsstrated attack
Com mputing enabled IT architeecture for Had doop, throughh a againstt TXT using Syystem Managem ment Mode (SM MM) [9].
com mbination of Node-level
N seervices, networrk services, and
a Wee will considerr the OS loadeer as the first ppractical step
soft
ftware integratiion. Two optio onal components are shown, thet in our cchain of trust iin this examplee. TrustedGRU UB is a Linux
firsst being a tradiitional runtimee host integrity
y monitor, whiich Boot L Loader that im mplements a Static Root of Trust by
is nnot related to the
t trusted com mputing compo onents. Althou
ugh measurring key OS fiiles [10]. Oncee a system firm mware starts,
we provide validaation of integriity during system initializatio
on, PCRs aare extended with the crypptographic hashhes of BIOS
thiss component can give us an additional level of insight in nto componnents, option R ROMS, and thee bootloader [10]. The TPM
runntime changes. This software would be partt of the measurred 1.2 speec calls for a m minimum of 224 160 bit PCR Rs [5]. PCRs

771
can only be reset at reboot and are extended as follows: PCR: calculates either the HMAC or the hash, and compares it with
=SHA1 (PCR + measurement) [5]. Since the PCR is always the version stored in 'security.evm' [12].
extended, it essentially maintains a chain of each item it was With the ability to maintain a locally validate base
extended with [6]. Once the final application is being environment, the combination of a trusted boot loader and
executed, the PCR should be of a known good value each time IMA/EVM will provide a good foundation for a trusted
demonstrating that previously loaded components are in a environment in which to launch a critical application such as
good state. PCRs 0-15 are reserved for Static Root of Trust Hadoop. Furthermore, IMA provides a more dynamic run-
measurements [5]. After BIOS passes control to the boot time verification environment than TrustedGRUB alone, as
loader, the second stage of our measured and verified boot individual files do not have to be called out for verification.
process begins. Trusted Grub validates each stage of the boot Still, it would not be practical to apply this type of
loader and extends PCRs, including the initrd (initial RAM measurement directly to Hadoop data blocks, as it would be
disk including the system kernel) before passing control to the normal for these to change and the extra overhead would be
kernel [10]. TrustedGRUB can also be configured to validate undesirable. The next level in the establishment of a trusted
other arbitrary files such as the password file. At a minimum, infrastructure for Hadoop will involve remote attestation of the
we will consider the additional measurement of the Hadoop node to a remote verifier. The Trusted Computing Group has
configurations files, Hadoop software, and Java software as defined an interface called Platform Trust Services to define
targets to check with the checkfile. Note that TrustedGRUB parameters for this exchange between a client and a remote
does not protect the known good values of files to be checked, verifier [13]. OpenPST is an implementation of this standard.
which means that if we only use TrustedGRUB, we will need Platform Trust Services are intended to work in conjunction
to tie something to the values of the PCRs (so we know the with other TCG efforts, particularly in the way of trusted
files were checked and in a known good state) and/or remotely network access via TNC. OpenPST has experimental ability to
attest the PCR values before letting the node join the cluster provide this connection [13].
[10]. Also, TrustedGRUB does not do anything itself to secure In Platform Attestation a reference manifest is created on
the remaining boot process (a weakness that is being a trusted system, essentially a database of known good values
addressed in TBOOT). The next step is to use these for a system, and stored on the remote verifier. An untrusted
measurements as part of a decryption process to decrypt client will submit an integrity report to the remote verifier,
sensitive data that was sealed with the TPMs Storage Root which it will check against the known values in the reference
Key and bound to these values. manifest [14]. This process is conducted over an encrypted
It is possible that measuring the boot loader, kernel, and channel. In and of itself, this process does not necessarily
key configuration files may be considered an adequate check provide any additional security. The next step is to build
while combined with traditional security practices. Our integration glue like TNC. In developing a trusted
Hadoop HDFS file system integrity files could be sealed and infrastructure for Hadoop, we will need to integrate these
bound to the PCRs generated in this state, which would results into our Hadoop infrastructure. This means we will
prevent that physical file system from being decrypted if need to block the execution of Hadoop on this node, and
something has changed. However, to get more granular control inform a management server that this node is potentially
and remote reporting, additional layers are required. This is compromised.
where our next node-level component comes in, the Linux Although not part of the hardware based trusted-
Integrity Subsystem Integrity Measurement Architecture and computing components, SELinux can also be used to provide a
Extended Verification Module (IMA/EVM). Part of the TCG policy-based Mandatory Access Control framework that will
requirement is that all Trusted Computing Base (TCB) files be work in conjunction with the integrity monitoring and
measured, and re-measured if the file has changed, before validation controls to protect key files from runtime
reading/executing the file [11]. The TCB is comprised of all modification [11]. This can, and should, include the actual
files critical in establishing a trusted environment [11]. The data block files. SELinux can provide fine grained access
IMA subsystem enables this by storing and maintaining an control, and can be configured to only allow the Hadoop Java
integrity measurement log, extending a PCR (PCR 10 by process to access these files and block other processes and
default), attesting (by signing the PCR with the TPM users.
endorsement key, and storing an integrity checksum as an We can now execute our Hadoop software and HDFS
extended file attribute [11]. It can also enable local validation environment with an assurance that the software and
of files. The IMA security attributes process can then be configuration are in a known state, and that the underlying
protected by the Extended Verification Module. EVM system has not been compromised. We can add value to our
provides a framework and two methods for detecting offline application in a few ways. With HDFS, the obvious choices
tampering of the security extended attributes [11]. The initial are protecting the NameNode’s metadata with binding and
method maintains an HMAC-SHA1 hash across a set of sealing via TPM encryption keys and platform state as
security extended attributes, storing the HMAC as the reported by the PCRs. On the DataNodes, we can ensure that
extended attribute 'security.evm' [12]. The other method is the data has not been altered when a node was offline by
based on a digital signature of the security extended attributes sealing and binding the block integrity metadata. We can
hash. To verify the integrity of an extended attribute, EVM re- protect data in transit between the nodes by configuring IPSec

772
(or SSL, SSH) tunnels using keys stored in, or protected by, satisfied at streaming speeds. If possible, each chunk will
the TPM. Hadoop web interfaces could be protected via SSL, reside on a different DataNode [4].
with the private key stored on the TPM using openssl-tpm- The Java framework could manage the
engine. encryption/decryption of the data block files, making calls to
the TPM directly via a Java Trusted Software Stack (jTSS)
B. Layer 2 – HDFS Software Components
interface to release a storage key. This would have to
The following are some possible areas for consideration for advantage of working outside of the operating system
Hadoop application layer integration with the Trusted configuration. The potential issue becomes software
Software Stack. decryption overhead. As a result, a user could mark more
Data Integrity sensitive data as encrypted instead of encrypting everything.
The HDFS client software implements checksum Also, not all data is necessarily needed at streaming speeds,
checking on the contents of HDFS files. When a client creates depending on how it is used.
an HDFS file, it computes a checksum of each block and Data Authorization and Protections
stores these checksums in a separate hidden file in the same HDFS uses UNIX style permissions on files. A client
HDFS namespace. When a client retrieves file contents it authenticates with a username and password. There is optional
verifies that the data it received matches the checksum stored support for Kerberos for stronger authentication services [4].
in the associated checksum file [4]. This could be extended by incorporate a user identity based
Hidden integrity files could be encrypted and sealed with on PKI, with interactions to the system over TLS. The user
a TPM key within the application, to increase the tamper could also be required to access the cluster from an authorized
evidence of the file system. This could be achieved within the system, based on a TPM identity. The TPM could be used to
HDFS code, or by altering the code to store the checksum files store and restrict access to PKI material.
in a separate space that would be bound and sealed at the OS V. PRELIMINARY IT ARCHITECTURE IMPLEMENTATION
level. If data is altered outside of HDFS, the checksum file AND EVALUATION
could not be altered without accessing the TPM. There is a
possible issue with TPM speed in data retrieval and To gain a preliminary idea of the effectiveness of a
concurrency, but may be mitigated by the fact that the file Hadoop HDFS cluster with trusted computing components, an
chunks are generally large and the unsealing of the integrity initial testing environment was created. This environment
file could be at the same time as retrieval. Another mitigation consisted of a two-node cluster implementing Trusted Grub,
could be the creation of a secure checksum and a standard one. IMA/EVM, and OpenPST, as well as encryption and IPSec
The secure checksum would not necessarily be checked on key protection using the TPM. Although architectures
each access, but at random, by request, or during an integrity implementing TCG Trusted Computing components harbor
verification operation when the node is not busy. Another the promise of increased security through a strong hardware
alternative would be to utilize a secondary process to based root of trust, there are still issues impeding the way of
periodically verify and encrypt file hashes and maintain a full realization of this promise. Complexity of configuration,
database of these hashes as a single file encrypted by a key and the lack of unity and support of the various packages
unique to each DataNode and protected by the TPM. A client implementing TCG capability, stand in the way of mainstream
application could request validation of the integrity adoption. Our implementation of Trusted Computing
information, if the application has sensitive data or on a components to secure a Hadoop HDFS architecture illustrated
random basis. Periodic tamper checks could be conducted on how these issues will be roadblocks to an institution looking to
the system against this database of checksums. implement this solution. For instance, supporting EVM/IMA
Metadata in Fedora 16 required rebuilding the kernel from source. In
The FsImage and the EditLog are central data structures fact, most components required manipulation of the build
of HDFS in a NameNode. A corruption of these files can environment and intense tweaking to use.
cause the HDFS instance to be non-functional [4]. Despite implementation challenges, it was possible to
Secure the FSImage and EditLog by binding and sealing secure data within HDFS in several ways. First, we can gain
to a TPM protected key and platform state. To accommodate insight into the state of the OS platform via the PCRs and can
backup NameNodes, this would have to use migratable keys. compare this with OpenPST against a known, good set. This
A hash could be taken periodically, and on a clean shutdown, gives us the ability to make an assumption of trust of the base
of the FSImage and sealed with TPM (compared on system platform, at least as of boot time, and is the primary benefit of
initialization). Potentially use a monotonic counter in the the these components. For instance, changing software that
EditLog entries to mark the order of commits. To verify extra was included in the measurements, such as a Hadoop Java
data was not added when a system was offline, an encrypted package, was easily detectable. Next, we can apply some
chain of hashes representing a line in the log could be HDFS specific protections. First, without these components in
maintained. place, it is possible to extract information about HDFS file
Data Blocks system blocks and alter these at a manual level via the local
Typical HDFS applications write their data only once but file system within a HDFS node. To avoid checksum failure,
they read it one or more times and require these reads to be this would have to be done on any copies of the file within the

773
cluster. Checksums are stored as binary files by HDFS nodes core, and provide a compelling boundary in a layered security
along with the data chunks; however, these can be updated design. The TCG Trusted Computing concepts discussed do
with the HDFS API by an attacker who modified the not provide a silver bullet. Although this technology has been
associated chunks. Although this solution does not prevent maturing for several years, the adoption and maturity level of
runtime modification of these files on its own, it is possible to software based on these standards is still evolving, and
prevent this from occurring when the system is offline due to potentially cumbersome for an organization to deploy in an
the checksums being in an encrypted directory. The desire was agile manner. Our research into a real-world trusted
to seal and bind the checksum data separate from the actual architecture for Hadoop illustrated this point. However, it is
data blocks. However, this proved difficult as the data blocks clear that organizations with real security concerns regarding
their Hadoop infrastructure (and IT in general) should begin
are stored in the same directory. As a result, from a system
considering what can be done with TCG trusted computing
level, a custom script would be needed to replace these with concepts.
symlinks and move the actual checksum files to a directory
encrypted with a TPM key and bound to a trusted PCR state. VII. REFERENCES
With this, we were not able to tamper with data blocks on an
[1] White, Tom. Hadoop The Definitive Guide. Sebastopol : O'Reilly, 2010.
offline system without being detected by a checksum failure, [2] O'Malley, Owen. Hadoop Security. [Online] 2010. [Cited: Jun 15, 2012.]
as these checksums were encrypted at rest. To reduce the risk http://www.slideshare.net/hadoopusergroup/1-hadoop-
of runtime modification, it would be advantageous for the securityindetailshadoopsummit2010.
checksums to be individually encrypted and decrypted by the [3] Jain, Nitin. Hadoop - How it manages security. [Online] 28 Nov, 2011.
[Cited: Jun 15, 2012.] http://clustermania.blogspot.com/2011/11/hadoop-how-
software itself as needed; this way an attacker would not have it-manages-security.html .
access to the checksums simply from having access to the file [4] Apache. HDFS Architecture Guide. [Online] May 8, 2012. [Cited: Jun 15,
system, as in the case of a directory that is unlocked at boot 2012.] http://hadoop.apache.org/common/docs/stable/hdfs_design.html.
time. Instead, an attacker would have to either manipulate the [5] Trusted Computing Group. TCG Specification Architecture Overview V.
1.4. Trusted Computing Group. [Online] Aug 2, 2007. [Cited: Jun 15, 2012.]
software to encrypt replacement checksums or gain access to http://www.trustedcomputinggroup.org/files/resource_files/AC652DE1-1D09-
the ability to encrypt a file with the TPM stored key. 3519-ADA026A0C05CFAC2/TCG_1_4_Architecture_Overview.pdf.
The NameNode partition was also protected with a TPM [6] TPM Main Part 1 Design Principles Specification Version 1.2 R 116.
key and tied to a platform PCR state via dm-crypt [15]. As a Trusted Computing Group. [Online] Mar 1, 2011. [Cited: Jun 15, 2012.]
http://www.trustedcomputinggroup.org/files/static_page_files/72C26AB5-
result, offline tampering with file metadata was not possible as 1A4B-B294-D002BC0B8C062FF6/TPM%20Main-
the data was encrypted and not accessible outside of the Part%201%20Design%20Principles_v1.2_rev116_01032011.pdf.
running system. Again, runtime manipulation by a super user [7. Ryan, Mark , Dinh Tien Tuan Anh. Trusted Computing: TCG proposals.
would still be possible. Data in transit protections were Computer Security lecture notes. [Online] Nov 4, 2006. [Cited: Jun 15, 2012.]
http://www.cs.bham.ac.uk/~mdr/teaching/modules/security/lectures/TrustedC
possible by generating key pairs for use in IPsec, however, a omputingTCG.html.
more automated method of setting up the tunnels with the keys [8] Nie, Cong. Dynamic Root of Trust in Trusted Computing.
would be desirable. One solution is to have the private key Telecomunications Software and Multimedia Laboratory. [Online] Oct 2007.
stored on an encrypted partition that is only unlocked when the [Cited: Jun 15, 2012.]
[9] Wojtczuk, Rafal and Rutkowska, Joanna. "Press Cheat Sheet" for the
system is in a trusted state. Again, this leaves the key open for Attacking Intel Trusted Execution Technology. Invisiblethingslab.com.
runtime compromise. Being able to keep the key on the TPM [Online] Feb 2009. [Cited: Jun 15, 2012.]
(such as with the openssl tpm engine) without human http://invisiblethingslab.com/press/itl-press-2009-02.pdf.
interaction would be desirable. [10] TrustedGRUB Documentation . [Online] Sirrix AG security technologies.
[Cited: Jun 15, 2012.]
http://projects.sirrix.com/trac/trustedgrub/wiki/Documentation.
VI. FUTURE WORK AND CONCLUSION [11] Linx IMA Wiki. Sourceforge IMA Project. [Online] May 18, 2012.
Future work will include examining the proposed Hadoop [Cited: Jun 15, 2012.] http://sourceforge.net/apps/mediawiki/linux-
ima/index.php?title=Main_Page.
software integrations that would make use of the TPM to [12] Zohar, Mimi. EVM. LWN.net. [Online] Jun 24, 2010. [Cited: Jun 15,
complement the IT components. OpenPTS attestation 2012.] http://lwn.net/Articles/393673/.
integration into a system that blocks a node from entering a [13] Munetoh, Seiji. Open Platform Trust Services. OpenPST Sourceforge
Hadoop cluster and provides an alert would also be desirable. Project. [Online] May 16, 2011. [Cited: Jun 15, 2012.]
http://iij.dl.sourceforge.jp/openpts/51879/userguide-0.2.4.pdf.
In addition, a more complete experimental evaluation of the [14] The return of EVM. LWN.net. [Online] Jun 30, 2010. [Cited: Jun 15,
relative security provided by the hardware trust enabled 2012.] http://lwn.net/Articles/394170/.
architecture is planned, as well as a performance test to [15] IBM. Securing sensitive files with TPM keys. Linux Information.
determine the trade-offs. [Online] IBM. [Cited: Jun 15, 2012.]
Data stored on distributed cloud computing systems, such http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2
Fliaai%2Ftpm%2Fliaaitpmstart.htm.
as Apache Hadoop, provide an attractive target for hackers,
insiders, competitors, and nation-state adversaries.
Implementing trusted computing concepts and utilizing
technologies like the TPM in an IT architecture, and even in a
software design, can provide a system with a good security

774

You might also like