How I Used Nokogiri like a Boss at My Day Job

Last week my manager emailed me a link to a web page with over 900 business addresses. It’s always bad when a request starts with “I know it’s a pain, but can you…” He asked me to manually add all of these addresses into an Excel spreadsheet for the company to use for marketing. I figured I had about four to five hours of mind-numbing cutting and pasting ahead of me when a lightbulb went off: Nokogiri.

The HTML page was pretty simple and formatted as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
<ul class="list-unstyled">
    <li class="lpr-location-name">
		    <strong>One of Many Business Names</strong>
		</li>
		<li class="lpr-location-address">
		    "1234 Main Street"
				<br>
				"Suite 3456"
				<br>
				"Brooklyn, NY 11215"
		</li>
		<li class="lpr-location-phone">Phone: Work 555-555-5555</li>
</ul>

The main problem I ran into was the use of <br> tag used to separate the street address, suite number, city and state, and phone. First I grabbed the entire address section using:

1
2
3
def get_addresses
    self.get_page.css("ul.list-unstyled li ul")
end

Then I drilled down to grab the street/city/state/phone section:

1
base = post.css("li.lpr-location-address").text

I unsuccessfully tried a few things to separate out this section. First I tried splitting the section at the <br> tags, but Nokogiri seemed to strip these out after I initially grabbed the code. Next I tried to doc.search and replace my ‘doc’ code, which didn’t seem to work. Finally, I had to get a little hack-y in the end. I downloaded the HTML page and replaced all the <br> tags with ^ using find and replace so I could easily split the addresses up.

After creating my local copy (since I knew the data wasn’t going to change) I grabbed the parts of the address using a combination of splitting up the data and searching through it using regex. The operation was pretty tedious but using rubular.com helped a lot. For example, I had to pull the city from an array and ignore the state and zip code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def make_addresses
	self.get_addresses.each do |post|
		address = Address.new

		base = post.css("li.lpr-location-address").text
		add_array = base.split("^")
		address.suite = nil
		address.title = post.css("li.lpr-location-name").text

		if add_array.length == 3
			address.street = add_array[0] 
			address.suite = add_array[1]
		elsif add_array.length <= 2
			address.street = add_array[0]
		end 

		address.city = add_array.last[/^[A-Z][a-z]+\s??[A-Z]?[a-z]+/]
		address.state = add_array.last[/[A-Z]{2}/]
		address.phone = post.css("li.lpr-location-phone").text.gsub("Phone: Work ", "")
		address.zip = add_array.last[/\d{5}/]

	end
end

Finally, I used the CSV class in Ruby (require ‘csv’) and wrote the addresses to a csv file:

1
2
3
4
5
6
7
8
9
10
11
12
def make_csv
	CSV.open("lib/address-list.csv", "wb") do |csv|
		self.make_addresses
			Address.all.each do |address|
				if address.suite == nil
					csv << ["#{address.title}", "#{address.street}", "#{address.city}", "#{address.state}", "#{address.zip}", "#{address.phone}"]
				else
					csv << ["#{address.title}", "#{address.street} - #{address.suite}", "#{address.city}", "#{address.state}", "#{address.zip}", "#{address.phone}"] 
				end
			end
	end
end

In the end, everything worked out. I saved myself a lot of tedious cutting and pasting work and I learned a lot. I’m almost certain there is a much better/fast way to write this code, but the end result provided exactly what I wanted.

<-- Back to blog list