| DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata | 
DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata
 use DTA::TokWrap::Processor::mkbx;
 
 $mbx = DTA::TokWrap::Processor::mkbx->new(%opts);
 $doc_or_undef = $mbx->mkbx($doc);DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.
Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.
DTA::TokWrap::Processor::mkbx inherits from DTA::TokWrap::Processor.
 $obj = $CLASS_OR_OBJECT->new(%args);Constructor.
%args, %$obj:
 ##-- Block-sorting: hints
 wbStr => $wbStr,                   ##-- word-break hint text
 sbStr => $sbStr,                   ##-- sentence-break hint text
 sortkey_attr => $attr,             ##-- sort-key attribute (default='dta.tw.key'; should jive with mkbx0)
 
 ##-- Block-sorting: low-level data
 xp    => $xml_parser,              ##-- XML::Parser object for parsing $doc->{bx0doc} %defaults = CLASS->defaults();Static class-dependent defaults.
 $mbx = $mbx->init();Dynamic object-dependent defaults.
 $xp = $mbx->initXmlParser();Create & initialize $mbx->{xp}, an XML::Parser object used to parse $doc->{bx0data}.
 $doc_or_undef = $CLASS_OR_OBJECT->mkbx($doc);Creates the serialized text-block-index $doc->{bxdata} for the DTA::TokWrap::Document object $doc.
Relevant %$doc keys:
 bx0doc  => $bx0doc,  ##-- (input) preliminary block-index data (XML::LibXML::Document)
 txfile  => $txfile,  ##-- (input) raw text index filename
 bxdata  => \@blocks, ##-- (output) serialized block index
 ##
 mkbx_stamp0 => $f,   ##-- (output) timestamp of operation begin
 mkbx_stamp  => $f,   ##-- (output) timestamp of operation end
 bxdata_stamp => $f,  ##-- (output) timestamp of operation endBlock data: @{$doc->{bxdata}} = @blocks = ($blk0, ..., $blkN); %$blk =
 key    => $sortkey, ##-- (inherited) sort key
 elt    => $eltname, ##-- element name which created this block
 xoff   => $xoff,    ##-- XML byte offset where this block run begins
 xlen   => $xlen,    ##-- XML byte length of this block (0 for hints)
 toff   => $toff,    ##-- raw-text (.tx) byte offset where this block run begins
 tlen   => $tlen,    ##-- raw-text (.tx) byte length of this block (0 for hints)
 otext  => $otext,   ##-- output text (.txt) for this block
 otoff  => $otoff,   ##-- output text (.txt) byte offset where this block run begins
 otlen  => $otlen,   ##-- output text (.txt) length (bytes) \@blocks = $mbx->prune_empty_blocks(\@blocks);
 \@blocks = $mbx->prune_empty_blocks();Low-level utility.
Removes empty 'c'-type blocks from @blocks (default=$mbx->{blocks}).
 \@blocks = $mbx->sort_blocks(\@blocks);Low-level utility.
Sorts \@blocks (default=$mbx->{blocks}) using $mb->{key2i}.
 \@blocks = $mbx->compute_block_text(\@blocks, \$txbuf);
 \@blocks = $mbx->compute_block_text(\@blocks);
 \@blocks = $mbx->compute_block_text();Low-level utility.
Sets $blk->{otoff}, $blk->{otlen}, $blk->{otext} for each block $blk in @blocks (default=$mbx->{blocks}) by extracting raw-text (.tx) substrings from \$txbuf (default=$mbx->{txbufr}).
\@blocks should already have been sorted before this method is called.
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
| DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata |