NAME HTML::AsText::Fix - extends HTML::Element::as_text() to render text properly VERSION version 0.003 SYNOPSIS # fix individual objects my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my $guard = HTML::AsText::Fix::object($tree); # fix deeply nested objects use URI; use Web::Scraper; # First, create your scraper block my $tweets = scraper { process "li.status", "tweets[]" => scraper { process ".entry-content", body => 'TEXT'; process ".entry-date", when => 'TEXT'; process 'a[rel="bookmark"]', link => '@href'; }; }; my $res; { my $guard = HTML::AsText::Fix::global(); $res = $tweets->scrape( URI->new("http://twitter.com/creaktive") ); } DESCRIPTION Consider the following HTML sample:
AAA BBB
Apple
In that case, there really shouldn't be a space between "A" and "pple". To handle inline nodes properly, only block nodes are separated by line break. Following nodes are currently assumed being blocks: * p * h1 h2 h3 h4 h5 h6 * dl dt dd * ol ul li * dir * address * blockquote * center * del * div * hr * ins * noscript script * pre * br (just to make sense) (source: http://en.wikipedia.org/wiki/HTML_element#Block_elements) FUNCTIONS as_text The replacement function. Not to be used separately. It is injected inside HTML::Element. global Hook into every HTML::Element within the lexical scope. Returns the guard object, destroying it will unhook safely. Accepts following options: * lf_char: character inserted between block nodes (by default, $/); * zwsp_char: character inserted between inline nodes (by default, "\x{200b}", Unicode zero-width space); * trim: trim heading/trailing spaces (considers "\x{A0}" as space!); * extra_chars: extra characters to trim; * skip_dels: if true, then text content under "del" nodes is not included in what's returned. For example, to completely get rid of separation between inline nodes: my $guard = HTML::AsText::Fix::global(zwsp_char => ''); object Hook object instance. Accepts the same options as "global": my $guard = HTML::AsText::Fix::object($tree, zwsp_char => ''); SEE ALSO * HTML::Element * HTML::Tree * HTML::FormatText * Monkey::Patch ACKNOWLEDGEMENTS * Αριστοτέλης Παγκαλτζής