Storage and Information Management - Unit 1 - Management Philosophies

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

Page 1

Information lifecycle management: Information lifecycle management (ILM) is a process for managing information through its lifecycle, from conception until disposal, in a manner that optimizes storage and access at the lowest cost. ILM is not just hardware or softwareit includes processes and policies to manage the information. It is designed upon the recognition that different types of information can have different values at different points in their lifecycle. Predicting storage needs and controlling costs can be especially challenging as the business grows. The overall objectives of managing information with ILM are to help reduce the total cost of ownership (TCO) and help implement data retention and compliance policies. In order to effectively implement ILM, owners of the data need to determine how information is created, how it ages, how it is modified, and if/when it can safely be deleted. ILM segments data according to value, which can help create an economical balance and sustainable strategy to align storage costs with businesses objectives and information value. ILM elements To manage the data lifecycle and make your business ready for on demand, there are four main elements that can address your business in an ILM structured environment. They are: 1) Tiered storage management 2) Long-term data retention 3) Data lifecycle management 4) Policy-based archive management Tiered storage management Most organizations today seek a storage solution that can help them manage data more efficiently. They want to reduce the costs of storing large and growing amounts of data and files and maintain business continuity. Through tiered storage, you can reduce overall disk-storage costs, by providing benefits like: 1) Reducing overall disk-storage costs by allocating the most recent and most critical business data to higher performance disk storage, while moving older and less critical business data to lower cost disk storage. 2) Speeding business processes by providing high-performance access to most recent and most frequently accessed data. 3) Reducing administrative tasks and human errors. Older data can be moved to lower cost disk storage automatically and transparently. Typical storage environment Storage environments typically have multiple tiers of data value, such as application data that is needed daily, and archive data that is accessed infrequently. However, typical storage configurations offer only a single tier of storage, as shown in Figure, which limits the ability to optimize cost and performance. Multi-tiered storage environment A tiered storage environment that utilizes the SAN infrastructure affords the flexibility to align storage cost with the changing value of information. The tiers will be related to data value. The most critical data is allocated to higher performance disk storage, while less critical business data is allocated to lower cost disk storage. An IBM ILM solution in a tiered storage environment is designed to:

Page 2

1) Reduce the total cost of ownership (TCO) of managing information. It can help optimize data costs and management, freeing expensive disk storage for the most valuable information. 2) Segment data according to value. This can help create an economical balance and sustainable strategy to align storage costs with business objectives and information value. 3) Help make decisions about moving, retaining, and deleting data, because ILM solutions are closely tied to applications. 4) Manage information and determine how it should be managed based on content, rather than migrating data based on technical specifications. This approach can help result in more responsive management, and offers you the ability to retain or delete information in accordance with business rules. 5) Provide the framework for a comprehensive enterprise content management strategy. Long-term data retention There is a rapidly growing class of data that is best described by the way in which it is managed rather than the arrangement of its bits. The most important attribute of this kind of data is its retention period, hence it is called retention managed data, and it is typically kept in an archive or a repository. In the past it has been variously known as archive data, fixed content data, reference data, unstructured data, and other terms implying its read-only nature. It is often measured in terabytes and is kept for long periods of time, sometimes forever. Data lifecycle management At its core, the process of ILM moves data up and down a path of tiered storage resources, including highperformance, high-capacity disk arrays, lower-cost disk arrays such as serial ATA (SATA), tape libraries, and permanent archival media where appropriate. Yet ILM involves more than just data movement; it encompasses scheduled deletion and regulatory compliance as well. Because decisions about moving, retaining, and deleting data are closely tied to application use of data, ILM solutions are usually closely tied to applications. By migrating unused data off of more costly, high-performance disks, ILM is designed to help: 1) Reduce costs to manage and retain data. 2) Improve application performance. 3) Reduce backup windows and ease system upgrades. 4) Streamline data management. 5) Allow the enterprise to respond to demandin real-time. 6) Support a sustainable storage management strategy. 7) Scale as the business grows. Policy-based archive management As businesses of all sizes migrate to e-business solutions and a new way of doing business, they already have mountains of data and content that have been captured, stored, and distributed across the enterprise. This wealth of information provides a unique opportunity. By incorporating these assets into e-business

Page 3

solutions, and at the same time delivering newly generated information media to their employees and clients, a business can reduce costs and information redundancy and leverage the potential profit-making aspects of their information assets. Five Pillars of Technology:

Technologies are not in one language. In fact, the internet necessarily breaks down language barriers. Tech stuff isnt about expressions of the divine, but it is all about idioms, idiomatic expressions of what people claim as important (even sacred?) in their lives thats the blogosphere in a nutshell Blogging, facebook, twitter is sooo biographical its almost too much for me. Theyre all about biography and community, though. Clearly, theres no one center of the net and thats what gives it enormous power This last one is tricky, because technology doesnt really lose its cohesiveness when met with a new environment, but it does become coopted and gain a stronger cohesiveness when

used well in the new setting. Data proliferation: Data proliferation refers to the unprecedented amount of data, structured and unstructured, that business and government continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers. Problems caused by data proliferation: 1)Difficulty when trying to find and retrieve information. 2)Data loss and legal liability when data is disorganized, not properly replicated,or cannot be found in a timely manner. 3)Increased manpower requirements to manage increasingly chaotic data storage resources. 4)Slower networks and application performance due to excess traffic as users search and search again for the material they need. 5)High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run. Proposed solutions: 1)Applications that better utilize modern technology. 2)Reductions in duplicate data (especially as caused by data movement). 3)Improvement of metadata structures. 4)Improvement of file and storage transfer structures.

Page 4

5)The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed. Data Center: A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls (e.g., air conditioning, fire suppression), and special security devices Data centers have their roots in the huge computer rooms of the early ages of the computing industry. Early computer systems were complex to operate and maintain, and needed a special environment to keep working. A lot of cables were necessary to connect all the parts. Also, old computers required a lot of power, and had to be cooled to avoid overheating. Security was important; computers were expensive, and were often used for military purposes. For this reason, engineering practices were developed since the start of the computing industry. Elements such as standard racks to mount equipment, elevated floors, and cable trays (installed overhead or under the elevated floor) were introduced in this early age, and have modernized relatively little compared to the computer systems themselves. A data center can occupy one room of a building, one or more floors, or an entire building. Most of the equipment is often in the form of servers racked up into 19 inch rack cabinets, which are usually placed in single rows forming corridors between them. This allows people access to the front and rear of each cabinet. The physical environment of the data center is usually under strict control: 1) Air conditioning is used to keep the room cool; it may also be used for humidity control. Generally, temperature is kept around 20-22 degrees Celsius (about 68-72 degrees Fahrenheit). The primary goal of data center air conditioning systems is to keep the server components at the board level within the manufacturer's specified temperature/humidity range. This is crucial since electronic equipment in a confined space generates much excess heat, and tends to malfunction if not adequately cooled. Air conditioning systems also help keep humidity within acceptable parameters. The humidity parameters are kept between 35% and 65% relative humidity. Too much humidity and water may begin to condense on internal components; too little and static electricity may damage components. 2) Data centers often have elaborate fire prevention and fire extinguishing systems. Modern data centers tend to have two kinds of fire alarm systems; a first system designed to spot the slightest sign of particles being given off by hot components, so a potential fire can be investigated and extinguished locally before it takes hold (sometimes, just by turning smoldering equipment off), and a second system designed to take full-scale action if the fire takes hold. Fire prevention and detection systems are also typically zoned and high-quality fire-doors and other physical fire-breaks used, so that even if a fire does break out it can be contained and extinguished within a small part of the facility. 3) Backup power is catered for via one or more uninterruptible power supplies and/or diesel generators. 4) To prevent single points of failure, all elements of the electrical systems, including backup system, are typically fully duplicated, and critical servers are connected to both the "A-side" and "B-side" power feeds. 5) Old data centers typically have raised flooring made up of 60 cm (2 ft) removable square tiles. The trend is towards 80-100cm void to cater for better and uniform air distribution. These provide a plenum for air to circulate below the floor, as part of the air conditioning system, as well as providing space for power cabling. 6) Using conventional water sprinkler systems on operational electrical equipment can do just as much damage as a fire. Originally Halon gas, a halogenated organic compound that chemically stops combustion, was used to extinguish flames. However, the use of Halon has been banned by the Montreal Protocol because of the danger Halon poses the ozone layer. More environmentally-friendly alternatives include Argonite and HFC-227.

Page 5

7) Physical security also plays a large role with data centers. Physical access to the site is usually restricted to selected personnel. Video camera surveillance and permanent security guards are almost always present if the data center is large or contains sensitive information on any of the systems within. The main purpose of a data center is running the applications that handle the core business and operational data of the organization. Communications in data centers today are most often based on networks running the IP protocol suite. Data centers contain a set of routers and switches that transport traffic between the servers and to the outside world. Network security elements are also usually deployed: firewalls, VPN gateways, intrusion detection systems, etc. Data centers are also used for off site backups. Companies may subscribe to backup services provided by a data center. Encrypted backups can be sent over the Internet to data center where they can be stored securely. Evolution of Storage System: Storage systems have become an important component of information technology. Storage systems are built by taking the basic capability of a storage device, such as the hard disk drive, and adding layers of hardware and software to obtain a highly reliable, high-performance, and easily managed system. The first data storage device was introduced by IBM in 1956. Since then there has been remarkable progress in hard disk drive (HDD) technology, and this has provided the fertile ground on which the entire industry of storage systems has been built. It has long been recognized that the disk drive alone cannot provide the range of storage capabilities required by enterprise systems. The first storage devices were directly controlled by the CPU. The key advantage of a control unit (or controller) was that the I/O commands from the CPU (sometimes called the host) were independently translated into the specific commands necessary to operate the HDD (sometimes called the direct access storage device, or DASD), and so the HDD device itself could be managed independently and asynchronously from the CPU. Storage systems leapt further ahead in the early 1990s when RAID (redundant array of independent disks) technology was introduced. RAID allowed the coordination of multiple HDD devices so as to provide higher levels of reliability and performance than could be provided by a single drive. The classical concept of parity was used to design reliable storage systems that continued to operate despite drive failures. Parallelism was used to provide higher levels of performance. RAID technology was delivered in low cost hardware and by the mid 1990s became standard on servers that could be purchased for a few thousand dollars. Many variations on RAID technology have been developed which are used in large external storage systems that provided significant additional function, including redundancy (no single point of failure in the storage system) and copy services (copying of data to a second storage system for availability). Disaster recovery became a requirement for all IT systems, and impacted the design of storage systems. Several principles were developed to save data in case of disaster e.g. a point-in-time copy (offered by IBM under the name FlashCopy) is the making of a consistent virtual copy of data as it appeared at a single point in time. This copy is then kept up to date by following pointers as changes are made. If desired, this virtual copy can, over time, be made into a real copy through physical copying. A second technique, mirroring or continuous copy (offered by IBM under the name Peer-to-Peer Remote Copy) involves two mirror copies of data, one at a primary (local) site and one at a secondary (recovery) site. We say this process is synchronous when data must be successfully written at the secondary system before the write issued by the primary system is acknowledged as complete. Although synchronous operation is desirable, it is practical only over limited distances (say, of the order of 100 km). Further the requirements for data availability were not completely satisfied by reliable storage systems as data could be accidentally erased (through human error or software corruption); so additional copies were also needed for backup purposes. Then Backup systems were developed that allowed users to make a complete backup of

Page 6

selected files or entire file systems. The traditional method of backup was to make a backup copy on tape, or in the case of a personal computer, on a set of floppy disks or a small tape cartridge. However, as systems became networked together, LAN-based backup systems replaced media-oriented approaches, and these ran automatically and unattended, often backing up from HDD to HDD. File-differential backup was subsequently introduced, in which only the changed bytes within a file are sent and managed at the backup server. Hierarchical Storage Management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices. In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will usually not notice any slowdown.

Page 7

Storage Management Challenges: 1)Variety of information: information technology holds the promise of bringing a variety of new types of information to the people who need it. 2)Volume of data: data is growing exponentially. 3)Velocity of change: It organizations are under tremendous pressure to deliver the right IT services. 85% of problems are caused by the changing IT staff. 80% of the problems are not detected by the IT staff until reported by the end user. 4)Leverage Information: This is Capitalize on data sharing for collaboration along with storage investments and informational value. One can address leverage information by reporting and data classification. Questions that may be asked are a)How much storage do I have available for my applications. b)Which applications, users and databases are the primary consumers of my storage? c)When do I need to buy more storage? d) How reliable is my SAN? e)How my storage is is being used. 5)Optimize IT: This is for automate and simplify the IT operations. Also to optimize performance and functionality. One can address optimize IT solutions by centralizing management and storage virtualization Questions asked are a)How do I simplify and centralized my storage infrastructure. b)How do I know the storage is not the bottleneck for user response time issues?

Page 8

c) Is the storage infrastructure available and performing as needed? 6)Mitigate Risks: This is for comply with regulatory and security requirement. Also to keep your business running continuously. One can address mitigate risks by tiered storage and ILM. Questions asked are a)How do I monitor and centrally manage my replication services? b)How do I maintain storage service levels? c)Which files must be backed up, archived and retained for compliance? 7)Enable Business Flexibility: This is for flexible, on demand IT Infrastructure and to protect your IT investments. One can address enable business flexibility by service management. Questions are a)How can I automate the provisioning of my storage systems, databases, file systems and SAN? b)How do I maintain storage service levels? c)How do I monitor and centrally manage my replication services? What needs to be managed? 1)servers Applications Databases File systems Volume managers Host Bus Adaptors and Multi-path drivers 2)Network components Switches, hubs, routers Intelligent switch replication

Page 9

3)Storage components Volume mapping/ Virtualization Storage array provisioning NAS filers Tape libraries 4)Discovery Topologies view Asset Management 5)Configuration Management Provisioning Optimization Problem determination 6)Performance Management Bottleneck Analysis Load Balancing 7)Reporting Asset/ Capacity/ Utilization Accounting/ Chargeback Performance/ Trending Problem reports Storage Resource Management:

Page 10

Few Issues Related to DATA Data identity: Persistent Unique Identifiers (or an alternative means to achieve this functionality) will enable global cross referencing between data objects. Such Identifiers will not only be used for data and software but also for other resources such as people, equipment, organizations etc. On the other hand, any scheme of identification is likely to undergo evolution so preservation, and in particular integration of archival and current data, is likely to require active management of identifiers. Data objects: Data will be made available with all the necessary metadata to enable reuse. From its creation, and throughout its lifecycle, data will be packaged with its metadata which will progressively accrue information regarding its history throughout its evolution. Data agents: Data will be intelligent in that it maintains for itself knowledge of where is has been used as well as what it uses. (This can be achieved by bidirectional links between data and its uses or by making associations between data themselves stand alone entities. In either case, active maintenance of the associations is required.) Software: Software will join simulations, data, multimedia and text as a core research output. It will therefore require similar treatment in terms of metadata schemas, metadata creation and propagation (including context of software development and use, resilience, versioning and rights). Data Forge: Rather like sourceForge for software. We imagine a global (probably distributed) self service repository of the data which is made available under a variety of access agreements. There a requirement on the data management technology for greater ease in data collection, interoperation, aggregation and access. Technology and tooling must be developed to meet these requirements both in the manifestation of the data itself and the software that manages it. Critical to this will be the collection, management and propagation of metadata along with the data itself. Data source
A data source is any of the following types of sources for (mostly) digitized data: a database a computer file

Page 11

a data stream A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished. Computer files can be considered as the modern counterpart of paper documents which traditionally were kept in offices' and libraries' files, which are the source of the term. In telecommunications and computing, a data stream is a sequence of digitally encoded coherent signals (packets of data or datapackets) used to transmit or receive information that is in transmission.[1] In electronics and computer architecture, a data stream determines for which time which data item is scheduled to enter or leave which port of a systolic array, a Reconfigurable Data Path Array or similar pipe network, or other processing unit or block. Often the data stream is seen as the counterpart of an instruction stream, since the von Neumann machine is instruction-stream-driven, whereas its counterpart, the Anti machine is data-stream-driven. The term "data stream" has many more meanings, such as by the definition from the context of systolic arrays. In formal way: A data stream is an ordered pair (s,) where: s is a sequence of tuples, is the sequence of time intervals (i.e. Rational or Real numbers) and each n > 0. DATA CLASSIFICATION: Data classification is the determining of class intervals and class boundaries in that data to be mapped and it depends in part on the number of observations. Most of the maps are designed with 4-6 classifications however with more observations you have to choose a large number of classes but too many classes are also not good, since it makes the map interpretation difficult. There are four classification methods for making a graduated color or graduated symbol map. All these methods reflect different patterns affecting the map display.

1 Natural Breaks Classification 2 Quantile Classification 3 Equal Interval Classification 4 Standard Deviation Classification

Natural Breaks Classification


It is a manual data classification method that divides data into classes based on the natural groups in the data distribution. It uses a statistical formula (Jenks optimization) that calculates groupings of data values based on data distribution, and also seeks to reduce variance within groups and maximize variance between groups.

Page 12

This method is based on subjective decision and it is best choice for combining similar values. Since the class ranges are specific to individual dataset, it is difficult to compare a map with another map and to choose the optimum number of classes especially if the data is evenly distributed.

Quantile Classification Quantile classification method distributes a set of values into groups that contain an equal number of values. This method places the same number of data values in each class and will never have empty classes or classes with too few or too many values. It is attractive in that this method always produces distinct map patterns. Equal Interval Classification
Equal Interval Classification method divides a set of attribute values into groups that contain an equal range of values. This method better communicates with continuous set of data. The map designed by using equal interval classification is easy to accomplish and read . It however is not good for clustered data because you might get the map with many features in one or two classes and some classes with no features because of clustered data.

Standard Deviation Classification


Standard deviation classification method finds the mean value, and then places class breaks above and below the mean at intervals of either 0.25, 0.5 or, one standard deviation until all the data values are contained within the classes. Values that are beyond the three standard deviations from the mean are aggregated into two classes; greater than three standard deviation above the mean and less than three standard deviation below the mean.

You might also like