Thread Benötige Perl-Skript zur Auswertung von .pdf-Dateien (14 answers)
Opened by ClaudiaRohmeier at 2013-03-06 15:09

murphy
 2013-03-11 19:28
#166321 #166321
User since
2004-07-19
1776 Artikel
HausmeisterIn
[Homepage]
user image
Hier mal ein kleines Skript zum Herumprobieren oder darauf Aufbauen:
Code (perl): (dl )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
#!/usr/bin/perl
use 5.012;
use warnings;

use Getopt::Long;
use Pod::Usage;
use Text::CSV;

my $out;
my $verb = 1;
my $help = 0;
GetOptions(
  'output|o=s' => \$out,
  'verbose|v+' => \$verb,
  'help|h|?' => \$help,
) or pod2usage(-exitstatus => 2);

if ($help) {
  pod2usage(-exitstatus => 0, -verbose => $verb);
}

my ($key, $doc) = @ARGV;
unless (defined $key and defined $doc) {
  pod2usage(-exitstatus => 2);
}

unless (defined $out) {
  $out = $doc =~ s/(?:\.[^.]+)?$/.csv/r;
}

$|++ if ($verb > 1);

say "Reading keywords from '$key' ..." if ($verb > 2);
my @keywords = do {
  open my $in, '<', $key or die "Error opening keyword file: $!";
  my %unique;
  while (my $_ = <$in>) {
    chomp;
    for my $keyword (split /[\s.:!?,;()]+/) {
      $unique{$keyword} = 1;
    }
  }
  keys %unique;
};

say scalar(@keywords), " keywords read" if ($verb > 1);

say "Scanning document '$doc', writing output to '$out' ..." if ($verb > 2);
my $ispdf = do {
  open my $in, '<', $doc or die "Error opening document file: $!";
  read $in, my $magic, 4;
  $magic eq '%PDF';
};

my $src;
if ($ispdf) {
  say "Document seems to be a PDF file" if ($verb > 2);
  open $src, '-|', 'pdftotext', $doc, '-' or die "Error opening document stream: $!";
}
else {
  say "Document does not seem to be a PDF file" if ($verb > 2);
  open $src, '<', $doc or die "Error opening document file: $!";
}

open my $tgt, '>', $out or die "Error opening output file: $!";
my $csv = Text::CSV->new({binary => 1, eol => $/});
$csv->print($tgt, [qw(Page Word Keyword Sentence)]);

my $page = 0;
my $word = 0;
my $sentence = '';
my @hits = ();
my $total = 0;
while (my $_ = <$src>) {
  chomp;
  while ($_ ne '') {
    if (s/^\f//) {
      $page += 1;
      $word = 0;
    }
    elsif (s/^([^\s.:!?,;()]+)//) {
      my $candidate = $1;
      for my $keyword (@keywords) {
        if ($candidate eq $keyword) {
          print "$page,$word ... " if ($verb > 2);
          push @hits, [$page, $word, $keyword];
        }
      }
      $sentence .= ' ' if ($sentence ne '');
      $sentence .= $candidate;
      $word += 1;
    }
    elsif (s/^([.:!?,;()])//) {
      $sentence .= $1;
      for my $hit (@hits) {
        push @$hit, $sentence;
        $csv->print($tgt, $hit);
      }
      $total += @hits;
      $sentence = '';
      @hits = ();
    }
    else {
      s/^\s+//;
    }
  }
}

say "Done" if ($verb > 2);
say "$total matches found" if ($verb > 1);

close $src or die "Failed to close document stream: $!";
close $tgt or die "Failed to close output stream: $!";

__END__

=head1 NAME

keywords - Find keywords in PDF or text files

=head1 SYNOPSIS

keywords [OPTION ...] KEYWORDS DOCUMENT

=head1 OPTIONS

=over 4

=item B<--output=FILE>

=item B<-o FILE>

Write output to the given file. If no such option is given, the output
filename is constructed by replacing the extension of the input document
by C<.csv>.

=item B<--verbose>

=item B<-v>

Increases the verbosity of program output. Up to two instances of this
option currently make sense.

=item B<--help>

=item B<-h>

=item B<-?>

Shows documentation about the program. Combine with B<--verbose> to
view the entire manual page.

=back

=head1 DESCRIPTION

This program reads a list of keywords from a file and scans another
file for occurrences of those keywords.

Both the keyword and document file are split into words separated by
whitespace or any of the sentence separator characters C<.:!?,;()>.

If the document file is not plain text but a PDF file, it is
automatically filtered through the program C<pdftotext> and the output
is scanned instead.

While scanning the document, each occurrence of a keyword is printed
to the output in CSV format. The fields printed are

=over 4

=item the current page number, determined by counting form feeds;

=item the number of the word counting from the start of the page;

=item the matched keyword and

=item the sentence in which the keyword occurred.

=back

=head1 LICENSE

Copyright (c) 2013 by Thomas Chust L<mailto:chust@web.de>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

=cut
When C++ is your hammer, every problem looks like your thumb.

View full thread Benötige Perl-Skript zur Auswertung von .pdf-Dateien