convert xml to csv in perl

It is clearly stated that the “file isn’t fully XML” and that XML libraries can’t be used. Bummer 🙁

Then this is parsed using regex. Remember that then one has to always keep an eye on input files to see whether their format changed; even a smallest change can easily throw off a regex, causing the program breakdown at best or, much worse, a quiet bug.

The shown format is easy to parse. Here’s a basic take, parsing the XML-like component section for any tags and their values, then printing for a given set of actual tags in needed order.

use warnings;
use strict;
use feature 'say';

my $section_name = 'component';   # XML-like section to parse
my @tags = qw(name age country);  # given tags and their order

my (%record, $in_XML);

while (<>) {
    if    (/^\s*<$section_name>\s*$/)   { $in_XML = 1 }
    elsif (/^\s*<\/$section_name>\s*$/) { $in_XML = 0 } 
    if ( $in_XML and m{<([^<]+)> ([^<]+) </\g{1}>}x ) { 
        push @{$record{$1}}, $2; 

# Print out CSV-style output, with given tags
say join ',', @tags;
for my $i (0..$#{$record{$tags[0]}}) { 
    say join ',', map { $record{$_}->[$i] } @tags;

A few assumptions are made about tags. Some important ones: each tag-pair is on one line; all tag names are unique. If these don’t hold the code need be adjusted, what can be done but would need some work.

On top of matching the XML-like opening and closing tag-pair, <tagname>...</tagname>, I’ve also added a flag for when the processing is inside a component section. Testing the flag inside the if condition allows for other processing outside of XML, otherwise we could have next if not $in_XML; before the if condition. This whole business may be unnecessary if there is no chance for an accidental XML-like tag-pair elsewhere in text.

Note that one doesn’t have to specify and use @tags but can print for tags as found in the file, which are my @tags = keys %record, if that is acceptable and if order doesn’t matter.

Please add tests of whether those tags and their values are indeed what one expects. Realistic input files tend to occasionally have missing or unexpected parts.

It’d be far better to remedy the “isn’t fully XML” (make it XML) and use a library, if possible.

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top