Textprocessingwithruby Preview

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Part I

Extract: Acquiring Text

The first part of our text processing journey is concerned with getting text into our
program. This text might reside in files, might be entered by the user, or might come
from other processes; wherever it comes from, we’ll learn how to read it.

We’ll also look at taking structure from the text that we read, learning how to parse
CSV files and even scrape information from web pages.
CHAPTER 1

Reading from Files


Our first concern when processing text is to get the text into our program,
and perhaps the most common place to source text is from the humble file.
Whether it’s log files from a server, exports from database, or text you’ve
written yourself, there’s lots of information that lives on the filesystem.
Learning to read from files effectively opens up a world of text to process.

Throughout the course of this chapter, we’ll look at how we can use Ruby to
reach text that resides in files. We’ll look at the basics you might expect, with
some methods to straightforwardly read files in one go. We’ll then look at a
technique that will allow us to read even the biggest files in a memory-efficient
way, by treating files as streams, and look at how this can give us random
access into even the largest files. Let’s take a look.

Opening a File
Before we can do something with a file, we need to open it. This signals our
intent to read from or write to the file, allowing Ruby to do the low-level that
make that intention actually happen on the filesystem. Once it’s done those
things, Ruby gives us a File object that we can use to manipulate the file.

Once we have this File object, we can do all sorts of things with it: read from
the file, write to it, inspect its permissions, find out its path on the filesystem,
check when it was last modified, and much more.

To open a file in Ruby, we use the open method of the File class, telling it the
path to the file we’d like to open. We pass a block to the open method, in which
we can do whatever we like with the file. Here’s an example:
File.open("file.txt") do |file|
# ...
end

report erratum • discuss


Chapter 1. Reading from Files •4

Because we passed a block to open, Ruby will automatically close the file for
us after the block finishes, freeing us from doing that cleanup work ourselves.
The argument that open passes to our block, which in this example I’ve called
file, is a File object that points to the file we’ve requested access to (in this case,
file.txt). Unless we tell Ruby otherwise, it will open files in read-only mode, so
we can’t write to them accidentally—a safe default.

Kernel#open
In the real world, it’s common to see people using the global open method rather than
explicitly using File.open:

open("file.txt") do |file|
# ...
end

As well as being shorter, which is always nice, this convenient method is actually a
wrapper for a number of different types of IO objects, not just files. You can use it to
open URLs, other processes, and more. We’ll cover some more uses of open later; for
now, use either File.open or regular open as you prefer.

There’s nothing in our block yet, so this code isn’t very useful; it doesn’t
actually do anything with the file once it’s opened. Let’s take a look at how
we can read content from the file.

Reading from a File


Once we’ve opened a file, the next step is to read its contents. We’ll start with
the simplest way to do this—reading the whole file into a string, allowing us
to perform many kinds of processing with the text contained in the file. We’ll
then look at how we can break the file’s content up into lines and loop through
them, a task that’s frequently necessary when processing log files, when
processing text written by people, and in many other situations.

Reading a Whole File at Once


The easiest way to access the contents of a file in Ruby is to read the entire
file in one go. It’s not always the right solution, especially when working with
bigger files, but it makes sense in many cases.

We can achieve this by using the read method on our File object:
File.open("file.txt") do |file|
contents = file.read
end

report erratum • discuss


Reading from a File •5

The read method returns for us a string containing the file’s contents, no
matter how large they might be.

Alternatively, if all we’re doing is reading the file and we have no further use
for the File object once we’ve done so, Ruby offers us a shortcut. There’s a read
method on the File class itself, and if we pass it the name of a file, then it will
open the file, read it, and close it for us, returning the contents:
contents = File.read("file.txt")

Whichever method we use, the result is that we have the entire contents of
the file stored in a string. This is useful if we want to blindly pass those con-
tents over to something else for processing—to a Markdown parser, for
example, or to insert it into a database, or to parse it as JSON. These are all
very common things to want to do, so read is a widely used method.

For example, if our file contained some JSON data, we could parse it using
Ruby’s built-in JSON library:
require "json"

json = File.read("file.json")
data = JSON.parse(json)

Often, though, we want to do something with the contents ourselves. The


most common task we’re likely to face is to split the file into lines and do
something with each line. Let’s look at a simple way to achieve this.

Line-by-line Processing
Lots of plain-text formats—log files, for instance—use the lines of a file as a
way of structuring the content within them. In files like this, each line repre-
sents a distinct item or record. It’s about the simplest way to separate data,
but this kind of structure is more than enough for many use cases, so it’s
something you’ll run into frequently when processing text.

One example of this sort of log file that you might have encountered before
is from the popular web server Apache. For each request made to it, Apache
will log some information: things like the IP address the request came from,
the date and time that the request was made, the URL that was requested,
and so on. The end result looks like this:
127.0.0.1 - [10/Oct/2014:13:55:36] "GET / HTTP/1.1" 200 561
127.0.0.1 - [10/Oct/2014:13:55:36] "GET /images/logo.png HTTP/1.1" 200 23260
192.168.0.42 - [10/Oct/2014:14:10:21] "GET / HTTP/1.1" 200 561
192.168.0.91 - [10/Oct/2014:14:20:51] "GET /person.jpg HTTP/1.1" 200 46780
192.168.0.42 - [10/Oct/2014:14:20:54] "GET /about.html HTTP/1.1" 200 483

report erratum • discuss


Chapter 1. Reading from Files •6

Let’s imagine we wanted to process this log file so that we could see all the
requests made by a certain IP address. Because each line in the file represents
one request, we need some way to loop over the lines in the file and check
whether each one matches our conditions—that is, whether the IP address
at the start of the line is the one we’re interested in.

One way to do this would be to use the readlines method on our File object. This
method reads the file in its entirety, breaking the content up into individual
lines and returning an array:
File.open("access_log") do |log_file|
requests = log_file.readlines
end

At this point, we’ve got an array—requests—that contains every line in the file.
The next step is to loop over those lines and only output the ones that match
our conditions:
File.open("access_log") do |log_file|
requests = log_file.readlines

requests.each do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
end

Using each, we loop over each request. We then ask the request if it starts
with 127.0.0.1, and if the response is true, we output it. Lines that don’t start
with 127.0.0.1 will simply be ignored.

While this solution works, it has a problem. Because it reads the whole file
at once, it consumes an amount of memory at least equal to the size of the
file. This will hold up okay for small files, but as our log file grows, so will the
memory consumed by our script.

If you think about it, though, we don’t actually need to have the whole file in
memory to solve our problem. We’re only ever dealing with one line of the file
at any given moment, so we only really need to have that particular line in
memory. For some problems it’s necessary to read the whole file at once, but
this isn’t one of them. Let’s look at how can we rework this example so that
we only read one line at a time.

report erratum • discuss

You might also like