Tuesday, May 22, 2012

Parse XML with Nokogiri then remove tags

Galloping Ghost of the Japanese Coast


Let's say you have an XML document containing authors.

The Nokogiri tutorial tells you can do this:

   authors = doc.xpath("//author")

And it shows you will get output like this:

   <author>Kernighan</author>
   <author>Ritchie</author>
   <author>Matsumoto</author>

How do you get rid of all those tags?

Instead of reading the Nokogiri documentation like I should have, I tried to further process this output.

A regex worked fine but you have to worry about exceptions if there are no matches. And it's ugly.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")

for i in 0..authors.length - 1
  auths[i] = /.*<author>(.*)<\/author>.*/.match(author[i].to_s)[1]
end

I also tried string substitution, which also worked fine. I didn't test a no match case.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")

for i in 0..authors.length - 1
  auths[i] = author[i].to_s.sub("<author>","").sub("</author>","")
end

I knew I was parsing already parsed data and thought there should be an option to suppress the tags. I received some good advice to look more closely at Nokogiri and I came up with this approach of popping Nodes off of the NodeSet.

If you need know how many Nodes there were originally you have to save a copy before the first pop.

doc = Nokogiri::XML(body)

auths = []
authors = doc.xpath("//author")    

while authors.length() > 0
  auths << authors.pop().inner_text()
end

I think there might be an approach where you can iterate over the NodeSet without worrying about length and without using pop() but I haven't figured it out yet.


No comments:

Post a Comment