Kevin Ladenheim: Parse XML with Nokogiri then remove tags

Galloping Ghost of the Japanese Coast

Let's say you have an XML document containing authors.

The Nokogiri tutorial tells you can do this:

authors = doc.xpath("//author")

And it shows you will get output like this:

<author>Kernighan</author>
<author>Ritchie</author>
<author>Matsumoto</author>

How do you get rid of all those tags?

Instead of reading the Nokogiri documentation like I should have, I tried to further process this output.

A regex worked fine but you have to worry about exceptions if there are no matches. And it's ugly.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")

for i in 0..authors.length - 1
  auths[i] = /.*<author>(.*)<\/author>.*/.match(author[i].to_s)[1]
end

I also tried string substitution, which also worked fine. I didn't test a no match case.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")

for i in 0..authors.length - 1
  auths[i] = author[i].to_s.sub("<author>","").sub("</author>","")
end

I knew I was parsing already parsed data and thought there should be an option to suppress the tags. I received some good advice to look more closely at Nokogiri and I came up with this approach of popping Nodes off of the NodeSet.

If you need know how many Nodes there were originally you have to save a copy before the first pop.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")    

while authors.length() > 0
  auths << authors.pop().inner_text()
end

I think there might be an approach where you can iterate over the NodeSet without worrying about length and without using pop() but I haven't figured it out yet.

Kevin Ladenheim

Tuesday, May 22, 2012

Parse XML with Nokogiri then remove tags

No comments:

Post a Comment