Organizacion de Archivos para Mejorar Rendimiento

Organización de Archivos para mejorar el Desempeño.
Objetivos
• Estudiar diferentes métodos para compresión de datos.
• Estudiar la Compresión de archivos como una manera sencilla de reusar espacio en un archivo.
• Estudiar procedimientos para eliminar registros de longitude fija que permitan que el espacio
disponible sea asignado dinámicamente.
• Utilizar listas enlazadas y pilas para manejar listas de espacio disponible en archivos.
• Estudiar diferentes métodos para resolver el problema de eliminar registros de longitude
variable en un archivo.
Contents
6.1 Data compression
6.2 Reclaiming space in files
6.3 Finding things quickly: An Introduction to internal sorting and binary searching
6.4 Keysorting
Data Compression(1)
• Reasons for data compression
– less storage
– transmitting faster, decreasing access time
– processing faster sequentially
Data Compression(2)
Using a different notation
• Fixed-Length fields are good candidates
• Decrease the # of bits by finding a more compact notation

ex) original state field notation is 16bits, but we can encode with 6bit notation because of the # of
all states are 50
• Cons.
– unreadable by human
– cost in encoding time
– decoding modules => increase the complexity of
s/w
=> used for particular application
Data Compression(3)
:Suppressing repeating sequences
• Run-length encoding algorithm
– read through pixels, copying pixel values to file in sequence, except the same pixel
value occurs more than once in succession
– when the same value occurs more than once in succession, substitute the following
three bytes
special run-length code indicator((ex) ff)
pixel value repeated
the number of times that value is repeated
• ex) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
 22 23 ff 24 07 25 ff 26 06 25 24
Data Compression(3)
:Suppressing repeating sequences
• Run-length encoding (cont’d)
– example of redundancy reduction

– cons.
• not guarantee any particular amount of space savings
• under some circumstances, compressed image is larger
than original image
– Why? Can you prevent this?
Data Compression(4)
:Assigning variable-length codes
• Morse code: oldest & most common scheme of variable-length code
• Some values occur more frequently than others
– that value should take the least amount of space
• Huffman coding
– base on probability of occurrence
• determine probabilities of each value occurring
• build binary tree with search path for each value
• more frequently occurring values are given shorter search
paths in tree
Data Compression(5)
:Assigning variable-length codes
• Huffman coding
 Letter: a b c d e f g
 Prob: 0.4 0.1 0.1 0.1 0.1 0.1 0.1
 Code: 1 010 011 0000 0001 0010 0011
 ex) the string “abde”
 101000000001
Huffman Tree
0
a(1)
00
01
b(010) c(011)
000 001
d(0000) e(0001) f(0010) g(0011)

Data Compression(6)
:Irreversible compression techniques
• Some information can be sacrificed
• Less common in data files
• Shrinking raster image
– 400-by-400 pixels to 100-by-100 pixels
– 1 pixel for every 16 pixels
• Speech compression
– voice coding (the lost information is of no little or no value)
Compression in UNIX
• System V
– pack & unpack use Huffman codes
– after compress file, appends “.z” to end of
packed file
• Berkeley UNIX
– compress & uncompress use Lempel-Ziv method
– after compress file, appends “.Z” to end of
compressed file
Record Deletion and Storage Compaction
• Storage compaction
– record deletion : just marks each deleted record
– reclamation of all deleted records
Deleting Fixed-length Records for Reclaiming Space
Dynamically(1)
• Reuse the space from deleted records as soon as possible

– deleted records must be marked in special way
– we could find the deleted space
• To make record reuse quickly, we need
– a way to know immediately if there are empty slots in the file
– a way to jump directly to one of those slots if they exist
=> Linked lists or Stacks for avail list
* avail list : a list that is made up of deleted records
Dynamically(2)
Linked List
Stack
Dynamically(3)
• Linking and stacking deleted records

– arranging and rearranging links are used to make one available record slot
point to the next
– second field of deleted record points to next record
Sample file showing linked list of deleted records
List head (first available record) => 3

0 1 2 3 4 5 6
Edwards... Bates... Wills... *-1 Maters... Browns... Chavez
Edwards...

0 1 2 3 4 5 6
Edwards... Bates... Wills... *-1 Maters... *3 Chavez

0 1 2 3 4 5 6
Edwards... *5 Wills... *-1 Maters... *3 Chavez
List head (first available record) => -1

0 1 2 3 4 5 6
Edwards... 1st new rec.. Wills... 3rd new rec.. Maters... 2nd new rec.. Chavez
Deleting Variable-length Records
• Avail list of variable-length records

– it has byte count of record at beginning of each record
– use byte offset instead of RRN
• Adding and removing records
– in adding records, search through avail list for right size (=>big enough)
Removal of a record from an avail list with variable-length records
Size Size
Si 38 72
Size
68
-1
ze
(a)Before removal
47
Size Size Size

47 New Link -1
38 68
(b)After removal
Size
Removed record
72
Storage Fragmentation
• Internal fragmentation (in fixed-length record)
– waste space within a record
– in variable-length records, minimize wasted space by doing away with internal
fragmentation
• External fragmentation (in variable-length record)
– unused space outside or between individual records
– three possible solutions
storage compaction
 coalescing the holes: a single, larger record slot
minimizing fragmentation by adopting placement
strategy
Internal Fragmentation
in Fixed-length Records
Unused space ->

Internal fragmentation
Ames | John | 123 Maple | Stillwater | OK | 740751 |...................................
Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 |
Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | .........................
64-byte fixed-length records

External Fragmentation
in Variable-length Records
Record[1] Record[2]
record
40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian |
length
Record[3]
9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb
bark | Des Moines | IA | 50311 |

External fragmentation
ex) Delete Record[2] and Insert New Record[i] : 12-byte unused space
52 Adams | Kits | 3301 Washington D.C | Forest Village | IA | 43563 |

Record[i]
Placement Strategies
• First-fit
– select the first available record slot
– suitable when lost space is due to internal fragmentation
• Best-fit
– select the available record slot closest in size
– avail list in ascending order
– suitable when lost space is due to internal fragmentation
• Worst-fit
– select the largest record slot
– avail list in descending order
– suitable when lost space is due to external fragmentation
Finding Things Quickly(1)
• Goal: Minimize the number of disk accesses

• Finding things in simple field and record files may have many seeks
• Binary search algorithm for fixed-sized record
int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType &key)
// binary search for key.
{
int low = 0; int high = file.NumRecs() - 1;
while (low <= high){
int guess = (high - low)/2;
file.ReadByRRN(obj, guess);
if(obj.Key () == key) return 1; // record found
if(obj.Key() < key) high = guess - 1; // search before guess
else low = guess + 1; // search after guess
}
return 0; // loop ended without finding key
}
Classes and Methods for Binary Search
Class KeyType {public
int operator == (KeyType &);
int operator < (KeyType &);
};
class RecType {public: KeyType Key();};
class FixedRecordFile{public:
int NumRecs();
int ReadByRRN (RecType & Record, int RRN);
};
• Binary search vs. Sequential search
– binary search
• O(log n)
• list is sorted by key
– sequential search
• O(n)
• Sorting a disk file in RAM
– read the entire file from disk to memory
– use internal sort (=sort in memory)
• UNIX sort utility uses internal sort
• Limitations of binary search & internal sort
– binary search requires more than one or two access c.f.) single access
by RRN
– keeping a file sorted is very expensive
– an internal sort works only on small files
Internal Sort
unsorted Read the entire file unsorted sorted

file file file
Sort in memory
disk
memory
Key Sorting & Its Limitations
• So called, “tag sort” : sorted thing is “key” only

• Sorting procedure
 Read only the keys into memory
 Sort the keys
 Rearrange the records in file by the sorted keys
• Advantage
– less RAM than internal sort
• Disadvantages(=Limitations)
– reading records in disk twice is required
– a lot of seeking for records for constructing a new(sorted) file
KEYNODES array
KEY RRN Records
HARRISON 1 Harrison|Susan|387 Eastern....
KELLOG 2 Kellog|Bill|17 Maple....
HARRIS 3 Harris|Margaret|4343 West....
Conceptual
.
view
.
before .
sorting .
BELL k Bell|Robert|8912 Hill....
In RAM On secondary storage

KEY RRN Records
BELL k Harrison|Susan|387 Eastern....
HARRIS 3 Kellog|Bill|17 Maple....
Conceptual Harris|Margaret|4343 West....
HARRISON 1
view
after sorting .
keys .
.
in RAM
.
KELLOG 2 Bell|Robert|8912 Hill....
Pseudocode for keysort(1)
 Program: keysort
 open input file as IN_FILE
 create output file as OUT_FILE
 read header record from IN_FILE and write a copy to OUT_FILE

 REC_COUNT := record count from header record
 /* read in records; set up KEYNODES array */
 for i := 1 to REC_COUNT
 read record from IN_FILE into BUFFER
 extract canonical key and place it in KEYNODES[i].KEY
 KEYNODES[i].KEY = i
 (continued....)
Pseudocode for keysort(2)
 /* sort KEYNODES[].KEY, thereby ordering RRNs correspondingly */

 sort(KEYNODES, REC_COUNT)
 /* read in records according to sorted order, and write them out in this order */
 for i := 1 to REC_COUNT
 seek in IN_FILE to record with RRN of KEYNODES[I].RRN write BUFFER
contents to OUT_FILE
 close IN_FILE and OUT_FILE
 end PROGRAM
Two Solutions
:why bother to write the file back?
• Write out sorted KEYNODES[] array without writing records back in sorted order
• KEYNODES[] array is used as index file

Relationship between the index file and the
data file
KEY RRN Records

BELL k Harrison|Susan|387 Eastern....
HARRIS 3 Kellog|Bill|17 Maple....
HARRISON 1 Harris|Margaret|4343 West....

.
.
.
.
KELLOG 2 Bell|Robert|8912 Hill....
Index file Original file

Pinned records(1)
• Records that are referenced to physical location of themselves by other
records
• Not free to alter physical location of records for avoiding dangling references
• Pinned records make sorting more difficult and sometimes impossible
– solution: use index file, while keeping actual data file in original order
Pinned records(2)
Record(i) dangling pointer
Record (i+1) Pinned Record
delete pinned record

Pinned Record
File with pinned records

Let’s Review !!!
6.1 Data compression
6.2 Reclaiming space in files
6.3 Finding things quickly: An Introduction to internal sorting and binary searching
6.4 Keysorting

Organizacion de Archivos para Mejorar Rendimiento

Uploaded by

Copyright:

Available Formats

You might also like

Organizacion de Archivos para Mejorar Rendimiento

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Organizacion de Archivos para Mejorar Rendimiento

Uploaded by

Copyright:

Available Formats

Organización de Archivos para mejorar el Desempeño.

• Decrease the # of bits by finding a more compact notation

– example of redundancy reduction

d(0000) e(0001) f(0010) g(0011)

• Reuse the space from deleted records as soon as possible

• Linking and stacking deleted records

List head (first available record) => 3

List head (first available record) => 5

List head (first available record) => 1

List head (first available record) => -1

• Avail list of variable-length records

Size Size Size

Unused space ->

Ames | John | 123 Maple | Stillwater | OK | 740751 |...................................

Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 |

Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | .........................

64-byte fixed-length records

bark | Des Moines | IA | 50311 |

52 Adams | Kits | 3301 Washington D.C | Forest Village | IA | 43563 |

• Goal: Minimize the number of disk accesses

unsorted Read the entire file unsorted sorted

• So called, “tag sort” : sorted thing is “key” only

In RAM On secondary storage

 read header record from IN_FILE and write a copy to OUT_FILE

 /* sort KEYNODES[].KEY, thereby ordering RRNs correspondingly */

• KEYNODES[] array is used as index file

KEY RRN Records

HARRISON 1 Harris|Margaret|4343 West....

Index file Original file

Record(i) dangling pointer

Record (i+1) Pinned Record

delete pinned record

File with pinned records

You might also like