tesseract  3.05.02
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include "allheaders.h"
24 #include "baseapi.h"
25 #include "math.h"
26 #include "renderer.h"
27 #include "strngs.h"
28 #include "tprintf.h"
29 
30 #ifdef _MSC_VER
31 #include "mathfix.h"
32 #endif
33 
34 /*
35 
36 Design notes from Ken Sharp, with light editing.
37 
38 We think one solution is a font with a single glyph (.notdef) and a
39 CIDToGIDMap which maps all the CIDs to 0. That map would then be
40 stored as a stream in the PDF file, and when flate compressed should
41 be pretty small. The font, of course, will be approximately the same
42 size as the one you currently use.
43 
44 I'm working on such a font now, the CIDToGIDMap is trivial, you just
45 create a stream object which contains 128k bytes (2 bytes per possible
46 CID and your CIDs range from 0 to 65535) and where you currently have
47 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
48 
49 Note that if, in future, you were to use a different (ie not 2 byte)
50 CMap for character codes you could trivially extend the CIDToGIDMap.
51 
52 The following is an explanation of how some of the font stuff works,
53 this may be too simple for you in which case please accept my
54 apologies, its hard to know how much knowledge someone has. You can
55 skip all this anyway, its just for information.
56 
57 The font embedded in a PDF file is usually intended just to be
58 rendered, but extensions allow for at least some ability to locate (or
59 copy) text from a document. This isn't something which was an original
60 goal of the PDF format, but its been retro-fitted, presumably due to
61 popular demand.
62 
63 To do this reliably the PDF file must contain a ToUnicode CMap, a
64 device for mapping character codes to Unicode code points. If one of
65 these is present, then this will be used to convert the character
66 codes into Unicode values. If its not present then the reader will
67 fall back through a series of heuristics to try and guess the
68 result. This is, as you would expect, prone to failure.
69 
70 This doesn't concern you of course, since you always write a ToUnicode
71 CMap, so because you are writing the text in text rendering mode 3 it
72 would seem that you don't really need to worry about this, but in the
73 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
74 attached to a font, so in order to get even copy/paste to work you
75 need to define a font.
76 
77 This is what leads to problems, tools like pdfwrite assume that they
78 are going to be able to (or even have to) modify the font entries, so
79 they require that the font being embedded be valid, and to be honest
80 the font Tesseract embeds isn't valid (for this purpose).
81 
82 
83 To see why lets look at how text is specified in a PDF file:
84 
85 (Test) Tj
86 
87 Now that looks like text but actually it isn't. Each of those bytes is
88 a 'character code'. When it comes to rendering the text a complex
89 sequence of events takes place, which converts the character code into
90 'something' which the font understands. Its entirely possible via
91 character mappings to have that text render as 'Sftu'
92 
93 For simple fonts (PostScript type 1), we use the character code as the
94 index into an Encoding array (256 elements), each element of which is
95 a glyph name, so this gives us a glyph name. We then consult the
96 CharStrings dictionary in the font, that's a complex object which
97 contains pairs of keys and values, you can use the key to retrieve a
98 given value. So we have a glyph name, we then use that as the key to
99 the dictionary and retrieve the associated value. For a type 1 font,
100 the value is a glyph program that describes how to draw the glyph.
101 
102 For CIDFonts, its a little more complicated. Because CIDFonts can be
103 large, using a glyph name as the key is unreasonable (it would also
104 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
105 as the key. CIDs are just numbers.
106 
107 But.... We don't use the character code as the CID. What we do is use
108 a CMap to convert the character code into a CID. We then use the CID
109 to key the CharStrings dictionary and proceed as before. So the 'CMap'
110 is the equivalent of the Encoding array, but its a more compact and
111 flexible representation.
112 
113 Note that you have to use the CMap just to find out how many bytes
114 constitute a character code, and it can be variable. For example you
115 can say if the first byte is 0x00->0x7f then its just one byte, if its
116 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
117 have seen CMaps defining character codes up to 5 bytes wide.
118 
119 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
120 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
121 a Glyph ID (GID) (and the LOCA table) which may well not be anything
122 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
123 the CIDs to GIDs, and we can then use the GID to get the glyph
124 description from the GLYF table of the font.
125 
126 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
127 
128 Looking at the PDF file I was supplied with we see that it contains
129 text like :
130 
131 <0x0075> Tj
132 
133 So we start by taking the character code (117) and look it up in the
134 CMap. Well you don't supply a CMap, you just use the Identity-H one
135 which is predefined. So character code 117 maps to CID 117. Then we
136 use the CIDToGIDMap, again you don't supply one, you just use the
137 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
138 were supplied with only contains 116 glyphs.
139 
140 Now for Latin that's not a huge problem, you can just supply a bigger
141 font. But for more complex languages that *is* going to be more of a
142 problem. Either you need to supply a font which contains glyphs for
143 all the possible CID->GID mappings, or we need to think laterally.
144 
145 Our solution using a TrueType CIDFont is to intervene at the
146 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
147 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
148 looking into now.
149 
150 It would also be possible to have a 'PostScript' (ie type 1 outlines)
151 CIDFont which contained 1 glyph, and a CMap which mapped all character
152 codes to CID 0. The effect would be the same.
153 
154 Its possible (I haven't checked) that the PostScript CIDFont and
155 associated CMap would be smaller than the TrueType font and associated
156 CIDToGIDMap.
157 
158 --- in a followup ---
159 
160 OK there is a small problem there, if I use GID 0 then Acrobat gets
161 upset about it and complains it cannot extract the font. If I set the
162 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
163 mad......
164 
165 */
166 
167 namespace tesseract {
168 
169 // Use for PDF object fragments. Must be large enough
170 // to hold a colormap with 256 colors in the verbose
171 // PDF representation.
172 static const int kBasicBufSize = 2048;
173 
174 // If the font is 10 pts, nominal character width is 5 pts
175 static const int kCharWidth = 2;
176 
177 // Used for memory allocation. A codepoint must take no more than this
178 // many bytes, when written in the PDF way. e.g. "<0063>" for the
179 // letter 'c'
180 static const int kMaxBytesPerCodepoint = 20;
181 
182 /**********************************************************************
183  * PDF Renderer interface implementation
184  **********************************************************************/
185 
186 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir)
187  : TessResultRenderer(outputbase, "pdf") {
188  TessPDFRenderer(outputbase, datadir, false);
189 }
190 
191 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
192  bool textonly)
193  : TessResultRenderer(outputbase, "pdf") {
194  obj_ = 0;
195  datadir_ = datadir;
196  textonly_ = textonly;
197  offsets_.push_back(0);
198 }
199 
200 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
201  offsets_.push_back(objectsize + offsets_.back());
202  obj_++;
203 }
204 
205 void TessPDFRenderer::AppendPDFObject(const char *data) {
206  AppendPDFObjectDIY(strlen(data));
207  AppendString((const char *)data);
208 }
209 
210 // Helper function to prevent us from accidentally writing
211 // scientific notation to an HOCR or PDF file. Besides, three
212 // decimal points are all you really need.
213 double prec(double x) {
214  double kPrecision = 1000.0;
215  double a = round(x * kPrecision) / kPrecision;
216  if (a == -0)
217  return 0;
218  return a;
219 }
220 
221 long dist2(int x1, int y1, int x2, int y2) {
222  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
223 }
224 
225 // Viewers like evince can get really confused during copy-paste when
226 // the baseline wanders around. So I've decided to project every word
227 // onto the (straight) line baseline. All numbers are in the native
228 // PDF coordinate system, which has the origin in the bottom left and
229 // the unit is points, which is 1/72 inch. Tesseract reports baselines
230 // left-to-right no matter what the reading order is. We need the
231 // word baseline in reading order, so we do that conversion here. Returns
232 // the word's baseline origin and length.
233 void GetWordBaseline(int writing_direction, int ppi, int height,
234  int word_x1, int word_y1, int word_x2, int word_y2,
235  int line_x1, int line_y1, int line_x2, int line_y2,
236  double *x0, double *y0, double *length) {
237  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
238  Swap(&word_x1, &word_x2);
239  Swap(&word_y1, &word_y2);
240  }
241  double word_length;
242  double x, y;
243  {
244  int px = word_x1;
245  int py = word_y1;
246  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
247  if (l2 == 0) {
248  x = line_x1;
249  y = line_y1;
250  } else {
251  double t = ((px - line_x2) * (line_x2 - line_x1) +
252  (py - line_y2) * (line_y2 - line_y1)) / l2;
253  x = line_x2 + t * (line_x2 - line_x1);
254  y = line_y2 + t * (line_y2 - line_y1);
255  }
256  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
257  word_x2, word_y2)));
258  word_length = word_length * 72.0 / ppi;
259  x = x * 72 / ppi;
260  y = height - (y * 72.0 / ppi);
261  }
262  *x0 = x;
263  *y0 = y;
264  *length = word_length;
265 }
266 
267 // Compute coefficients for an affine matrix describing the rotation
268 // of the text. If the text is right-to-left such as Arabic or Hebrew,
269 // we reflect over the Y-axis. This matrix will set the coordinate
270 // system for placing text in the PDF file.
271 //
272 // RTL
273 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
274 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
275 void AffineMatrix(int writing_direction,
276  int line_x1, int line_y1, int line_x2, int line_y2,
277  double *a, double *b, double *c, double *d) {
278  double theta = atan2(static_cast<double>(line_y1 - line_y2),
279  static_cast<double>(line_x2 - line_x1));
280  *a = cos(theta);
281  *b = sin(theta);
282  *c = -sin(theta);
283  *d = cos(theta);
284  switch(writing_direction) {
286  *a = -*a;
287  *b = -*b;
288  break;
290  // TODO(jbreiden) Consider using the vertical PDF writing mode.
291  break;
292  default:
293  break;
294  }
295 }
296 
297 // There are some really awkward PDF viewers in the wild, such as
298 // 'Preview' which ships with the Mac. They do a better job with text
299 // selection and highlighting when given perfectly flat baseline
300 // instead of very slightly tilted. We clip small tilts to appease
301 // these viewers. I chose this threshold large enough to absorb noise,
302 // but small enough that lines probably won't cross each other if the
303 // whole page is tilted at almost exactly the clipping threshold.
304 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
305  int *line_x1, int *line_y1,
306  int *line_x2, int *line_y2) {
307  *line_x1 = x1;
308  *line_y1 = y1;
309  *line_x2 = x2;
310  *line_y2 = y2;
311  double rise = abs(y2 - y1) * 72 / ppi;
312  double run = abs(x2 - x1) * 72 / ppi;
313  if (rise < 2.0 && 2.0 < run)
314  *line_y1 = *line_y2 = (y1 + y2) / 2;
315 }
316 
317 bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
318  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
319  tprintf("Dropping invalid codepoint %d\n", code);
320  return false;
321  }
322  if (code < 0x10000) {
323  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
324  } else {
325  int a = code - 0x010000;
326  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
327  int low_surrogate = (0x03FF & a) + 0xDC00;
328  snprintf(utf16, kMaxBytesPerCodepoint,
329  "%04X%04X", high_surrogate, low_surrogate);
330  }
331  return true;
332 }
333 
334 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
335  double width, double height) {
336  STRING pdf_str("");
337  double ppi = api->GetSourceYResolution();
338 
339  // These initial conditions are all arbitrary and will be overwritten
340  double old_x = 0.0, old_y = 0.0;
341  int old_fontsize = 0;
342  tesseract::WritingDirection old_writing_direction =
344  bool new_block = true;
345  int fontsize = 0;
346  double a = 1;
347  double b = 0;
348  double c = 0;
349  double d = 1;
350 
351  // TODO(jbreiden) This marries the text and image together.
352  // Slightly cleaner from an abstraction standpoint if this were to
353  // live inside a separate text object.
354  pdf_str += "q ";
355  pdf_str.add_str_double("", prec(width));
356  pdf_str += " 0 0 ";
357  pdf_str.add_str_double("", prec(height));
358  pdf_str += " 0 0 cm";
359  if (!textonly_) {
360  pdf_str += " /Im1 Do";
361  }
362  pdf_str += " Q\n";
363 
364  int line_x1 = 0;
365  int line_y1 = 0;
366  int line_x2 = 0;
367  int line_y2 = 0;
368 
369  ResultIterator *res_it = api->GetIterator();
370  while (!res_it->Empty(RIL_BLOCK)) {
371  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
372  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
373  old_fontsize = 0; // Every block will declare its fontsize
374  new_block = true; // Every block will declare its affine matrix
375  }
376 
377  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
378  int x1, y1, x2, y2;
379  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
380  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
381  }
382 
383  if (res_it->Empty(RIL_WORD)) {
384  res_it->Next(RIL_WORD);
385  continue;
386  }
387 
388  // Writing direction changes at a per-word granularity
389  tesseract::WritingDirection writing_direction;
390  {
391  tesseract::Orientation orientation;
392  tesseract::TextlineOrder textline_order;
393  float deskew_angle;
394  res_it->Orientation(&orientation, &writing_direction,
395  &textline_order, &deskew_angle);
396  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
397  switch (res_it->WordDirection()) {
398  case DIR_LEFT_TO_RIGHT:
399  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
400  break;
401  case DIR_RIGHT_TO_LEFT:
402  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
403  break;
404  default:
405  writing_direction = old_writing_direction;
406  }
407  }
408  }
409 
410  // Where is word origin and how long is it?
411  double x, y, word_length;
412  {
413  int word_x1, word_y1, word_x2, word_y2;
414  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
415  GetWordBaseline(writing_direction, ppi, height,
416  word_x1, word_y1, word_x2, word_y2,
417  line_x1, line_y1, line_x2, line_y2,
418  &x, &y, &word_length);
419  }
420 
421  if (writing_direction != old_writing_direction || new_block) {
422  AffineMatrix(writing_direction,
423  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
424  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
425  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
426  pdf_str.add_str_double(" ", prec(c)); // . system for all
427  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
428  pdf_str.add_str_double(" ", prec(x)); // .
429  pdf_str.add_str_double(" ", prec(y)); // .
430  pdf_str += (" Tm "); // Place cursor absolutely
431  new_block = false;
432  } else {
433  double dx = x - old_x;
434  double dy = y - old_y;
435  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
436  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
437  pdf_str += (" Td "); // Relative moveto
438  }
439  old_x = x;
440  old_y = y;
441  old_writing_direction = writing_direction;
442 
443  // Adjust font size on a per word granularity. Pay attention to
444  // fontsize, old_fontsize, and pdf_str. We've found that for
445  // in Arabic, Tesseract will happily return a fontsize of zero,
446  // so we make up a default number to protect ourselves.
447  {
448  bool bold, italic, underlined, monospace, serif, smallcaps;
449  int font_id;
450  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
451  &serif, &smallcaps, &fontsize, &font_id);
452  const int kDefaultFontsize = 8;
453  if (fontsize <= 0)
454  fontsize = kDefaultFontsize;
455  if (fontsize != old_fontsize) {
456  char textfont[20];
457  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
458  pdf_str += textfont;
459  old_fontsize = fontsize;
460  }
461  }
462 
463  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
464  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
465  STRING pdf_word("");
466  int pdf_word_len = 0;
467  do {
468  const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
469  if (grapheme && grapheme[0] != '\0') {
470  GenericVector<int> unicodes;
471  UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
472  char utf16[kMaxBytesPerCodepoint];
473  for (int i = 0; i < unicodes.length(); i++) {
474  int code = unicodes[i];
475  if (CodepointToUtf16be(code, utf16)) {
476  pdf_word += utf16;
477  pdf_word_len++;
478  }
479  }
480  }
481  delete []grapheme;
482  res_it->Next(RIL_SYMBOL);
483  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
484  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
485  double h_stretch =
486  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
487  pdf_str.add_str_double("", h_stretch);
488  pdf_str += " Tz"; // horizontal stretch
489  pdf_str += " [ <";
490  pdf_str += pdf_word; // UTF-16BE representation
491  pdf_str += "> ] TJ"; // show the text
492  }
493  if (last_word_in_line) {
494  pdf_str += " \n";
495  }
496  if (last_word_in_block) {
497  pdf_str += "ET\n"; // end the text object
498  }
499  }
500  char *ret = new char[pdf_str.length() + 1];
501  strcpy(ret, pdf_str.string());
502  delete res_it;
503  return ret;
504 }
505 
507  char buf[kBasicBufSize];
508  size_t n;
509 
510  n = snprintf(buf, sizeof(buf),
511  "%%PDF-1.5\n"
512  "%%%c%c%c%c\n",
513  0xDE, 0xAD, 0xBE, 0xEB);
514  if (n >= sizeof(buf)) return false;
515  AppendPDFObject(buf);
516 
517  // CATALOG
518  n = snprintf(buf, sizeof(buf),
519  "1 0 obj\n"
520  "<<\n"
521  " /Type /Catalog\n"
522  " /Pages %ld 0 R\n"
523  ">>\n"
524  "endobj\n",
525  2L);
526  if (n >= sizeof(buf)) return false;
527  AppendPDFObject(buf);
528 
529  // We are reserving object #2 for the /Pages
530  // object, which I am going to create and write
531  // at the end of the PDF file.
532  AppendPDFObject("");
533 
534  // TYPE0 FONT
535  n = snprintf(buf, sizeof(buf),
536  "3 0 obj\n"
537  "<<\n"
538  " /BaseFont /GlyphLessFont\n"
539  " /DescendantFonts [ %ld 0 R ]\n"
540  " /Encoding /Identity-H\n"
541  " /Subtype /Type0\n"
542  " /ToUnicode %ld 0 R\n"
543  " /Type /Font\n"
544  ">>\n"
545  "endobj\n",
546  4L, // CIDFontType2 font
547  6L // ToUnicode
548  );
549  if (n >= sizeof(buf)) return false;
550  AppendPDFObject(buf);
551 
552  // CIDFONTTYPE2
553  n = snprintf(buf, sizeof(buf),
554  "4 0 obj\n"
555  "<<\n"
556  " /BaseFont /GlyphLessFont\n"
557  " /CIDToGIDMap %ld 0 R\n"
558  " /CIDSystemInfo\n"
559  " <<\n"
560  " /Ordering (Identity)\n"
561  " /Registry (Adobe)\n"
562  " /Supplement 0\n"
563  " >>\n"
564  " /FontDescriptor %ld 0 R\n"
565  " /Subtype /CIDFontType2\n"
566  " /Type /Font\n"
567  " /DW %d\n"
568  ">>\n"
569  "endobj\n",
570  5L, // CIDToGIDMap
571  7L, // Font descriptor
572  1000 / kCharWidth);
573  if (n >= sizeof(buf)) return false;
574  AppendPDFObject(buf);
575 
576  // CIDTOGIDMAP
577  const int kCIDToGIDMapSize = 2 * (1 << 16);
578  unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
579  for (int i = 0; i < kCIDToGIDMapSize; i++) {
580  cidtogidmap[i] = (i % 2) ? 1 : 0;
581  }
582  size_t len;
583  unsigned char *comp =
584  zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
585  delete[] cidtogidmap;
586  n = snprintf(buf, sizeof(buf),
587  "5 0 obj\n"
588  "<<\n"
589  " /Length %lu /Filter /FlateDecode\n"
590  ">>\n"
591  "stream\n",
592  (unsigned long)len);
593  if (n >= sizeof(buf)) {
594  lept_free(comp);
595  return false;
596  }
597  AppendString(buf);
598  long objsize = strlen(buf);
599  AppendData(reinterpret_cast<char *>(comp), len);
600  objsize += len;
601  lept_free(comp);
602  const char *endstream_endobj =
603  "endstream\n"
604  "endobj\n";
605  AppendString(endstream_endobj);
606  objsize += strlen(endstream_endobj);
607  AppendPDFObjectDIY(objsize);
608 
609  const char *stream =
610  "/CIDInit /ProcSet findresource begin\n"
611  "12 dict begin\n"
612  "begincmap\n"
613  "/CIDSystemInfo\n"
614  "<<\n"
615  " /Registry (Adobe)\n"
616  " /Ordering (UCS)\n"
617  " /Supplement 0\n"
618  ">> def\n"
619  "/CMapName /Adobe-Identify-UCS def\n"
620  "/CMapType 2 def\n"
621  "1 begincodespacerange\n"
622  "<0000> <FFFF>\n"
623  "endcodespacerange\n"
624  "1 beginbfrange\n"
625  "<0000> <FFFF> <0000>\n"
626  "endbfrange\n"
627  "endcmap\n"
628  "CMapName currentdict /CMap defineresource pop\n"
629  "end\n"
630  "end\n";
631 
632  // TOUNICODE
633  n = snprintf(buf, sizeof(buf),
634  "6 0 obj\n"
635  "<< /Length %lu >>\n"
636  "stream\n"
637  "%s"
638  "endstream\n"
639  "endobj\n", (unsigned long) strlen(stream), stream);
640  if (n >= sizeof(buf)) return false;
641  AppendPDFObject(buf);
642 
643  // FONT DESCRIPTOR
644  n = snprintf(buf, sizeof(buf),
645  "7 0 obj\n"
646  "<<\n"
647  " /Ascent %d\n"
648  " /CapHeight %d\n"
649  " /Descent -1\n" // Spec says must be negative
650  " /Flags 5\n" // FixedPitch + Symbolic
651  " /FontBBox [ 0 0 %d %d ]\n"
652  " /FontFile2 %ld 0 R\n"
653  " /FontName /GlyphLessFont\n"
654  " /ItalicAngle 0\n"
655  " /StemV 80\n"
656  " /Type /FontDescriptor\n"
657  ">>\n"
658  "endobj\n",
659  1000,
660  1000,
661  1000 / kCharWidth,
662  1000,
663  8L // Font data
664  );
665  if (n >= sizeof(buf)) return false;
666  AppendPDFObject(buf);
667 
668  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
669  if (n >= sizeof(buf)) return false;
670  FILE *fp = fopen(buf, "rb");
671  if (!fp) {
672  tprintf("Can not open file \"%s\"!\n", buf);
673  return false;
674  }
675  fseek(fp, 0, SEEK_END);
676  long int size = ftell(fp);
677  fseek(fp, 0, SEEK_SET);
678  char *buffer = new char[size];
679  if (fread(buffer, 1, size, fp) != size) {
680  fclose(fp);
681  delete[] buffer;
682  return false;
683  }
684  fclose(fp);
685  // FONTFILE2
686  n = snprintf(buf, sizeof(buf),
687  "8 0 obj\n"
688  "<<\n"
689  " /Length %ld\n"
690  " /Length1 %ld\n"
691  ">>\n"
692  "stream\n", size, size);
693  if (n >= sizeof(buf)) {
694  delete[] buffer;
695  return false;
696  }
697  AppendString(buf);
698  objsize = strlen(buf);
699  AppendData(buffer, size);
700  delete[] buffer;
701  objsize += size;
702  AppendString(endstream_endobj);
703  objsize += strlen(endstream_endobj);
704  AppendPDFObjectDIY(objsize);
705  return true;
706 }
707 
708 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
709  char *filename,
710  long int objnum,
711  char **pdf_object,
712  long int *pdf_object_size) {
713  size_t n;
714  char b0[kBasicBufSize];
715  char b1[kBasicBufSize];
716  char b2[kBasicBufSize];
717  if (!pdf_object_size || !pdf_object)
718  return false;
719  *pdf_object = NULL;
720  *pdf_object_size = 0;
721  if (!filename)
722  return false;
723 
724  L_Compressed_Data *cid = NULL;
725  const int kJpegQuality = 85;
726 
727  int format, sad;
728  findFileFormat(filename, &format);
729  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
730  Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
731  sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
732  pixDestroy(&p1);
733  } else {
734  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
735  }
736 
737  if (sad || !cid) {
738  l_CIDataDestroy(&cid);
739  return false;
740  }
741 
742  const char *group4 = "";
743  const char *filter;
744  switch(cid->type) {
745  case L_FLATE_ENCODE:
746  filter = "/FlateDecode";
747  break;
748  case L_JPEG_ENCODE:
749  filter = "/DCTDecode";
750  break;
751  case L_G4_ENCODE:
752  filter = "/CCITTFaxDecode";
753  group4 = " /K -1\n";
754  break;
755  case L_JP2K_ENCODE:
756  filter = "/JPXDecode";
757  break;
758  default:
759  l_CIDataDestroy(&cid);
760  return false;
761  }
762 
763  // Maybe someday we will accept RGBA but today is not that day.
764  // It requires creating an /SMask for the alpha channel.
765  // http://stackoverflow.com/questions/14220221
766  const char *colorspace;
767  if (cid->ncolors > 0) {
768  n = snprintf(b0, sizeof(b0),
769  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
770  cid->ncolors - 1, cid->cmapdatahex);
771  if (n >= sizeof(b0)) {
772  l_CIDataDestroy(&cid);
773  return false;
774  }
775  colorspace = b0;
776  } else {
777  switch (cid->spp) {
778  case 1:
779  colorspace = " /ColorSpace /DeviceGray\n";
780  break;
781  case 3:
782  colorspace = " /ColorSpace /DeviceRGB\n";
783  break;
784  default:
785  l_CIDataDestroy(&cid);
786  return false;
787  }
788  }
789 
790  int predictor = (cid->predictor) ? 14 : 1;
791 
792  // IMAGE
793  n = snprintf(b1, sizeof(b1),
794  "%ld 0 obj\n"
795  "<<\n"
796  " /Length %ld\n"
797  " /Subtype /Image\n",
798  objnum, (unsigned long) cid->nbytescomp);
799  if (n >= sizeof(b1)) {
800  l_CIDataDestroy(&cid);
801  return false;
802  }
803 
804  n = snprintf(b2, sizeof(b2),
805  " /Width %d\n"
806  " /Height %d\n"
807  " /BitsPerComponent %d\n"
808  " /Filter %s\n"
809  " /DecodeParms\n"
810  " <<\n"
811  " /Predictor %d\n"
812  " /Colors %d\n"
813  "%s"
814  " /Columns %d\n"
815  " /BitsPerComponent %d\n"
816  " >>\n"
817  ">>\n"
818  "stream\n",
819  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
820  group4, cid->w, cid->bps);
821  if (n >= sizeof(b2)) {
822  l_CIDataDestroy(&cid);
823  return false;
824  }
825 
826  const char *b3 =
827  "endstream\n"
828  "endobj\n";
829 
830  size_t b1_len = strlen(b1);
831  size_t b2_len = strlen(b2);
832  size_t b3_len = strlen(b3);
833  size_t colorspace_len = strlen(colorspace);
834 
835  *pdf_object_size =
836  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
837  *pdf_object = new char[*pdf_object_size];
838 
839  char *p = *pdf_object;
840  memcpy(p, b1, b1_len);
841  p += b1_len;
842  memcpy(p, colorspace, colorspace_len);
843  p += colorspace_len;
844  memcpy(p, b2, b2_len);
845  p += b2_len;
846  memcpy(p, cid->datacomp, cid->nbytescomp);
847  p += cid->nbytescomp;
848  memcpy(p, b3, b3_len);
849  l_CIDataDestroy(&cid);
850  return true;
851 }
852 
854  size_t n;
855  char buf[kBasicBufSize];
856  char buf2[kBasicBufSize];
857  Pix *pix = api->GetInputImage();
858  char *filename = (char *)api->GetInputName();
859  int ppi = api->GetSourceYResolution();
860  if (!pix || ppi <= 0)
861  return false;
862  double width = pixGetWidth(pix) * 72.0 / ppi;
863  double height = pixGetHeight(pix) * 72.0 / ppi;
864 
865  snprintf(buf2, sizeof(buf2), "/XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
866  const char *xobject = (textonly_) ? "" : buf2;
867 
868  // PAGE
869  n = snprintf(buf, sizeof(buf),
870  "%ld 0 obj\n"
871  "<<\n"
872  " /Type /Page\n"
873  " /Parent %ld 0 R\n"
874  " /MediaBox [0 0 %.2f %.2f]\n"
875  " /Contents %ld 0 R\n"
876  " /Resources\n"
877  " <<\n"
878  " %s"
879  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
880  " /Font << /f-0-0 %ld 0 R >>\n"
881  " >>\n"
882  ">>\n"
883  "endobj\n",
884  obj_,
885  2L, // Pages object
886  width, height,
887  obj_ + 1, // Contents object
888  xobject, // Image object
889  3L); // Type0 Font
890  if (n >= sizeof(buf)) return false;
891  pages_.push_back(obj_);
892  AppendPDFObject(buf);
893 
894  // CONTENTS
895  char* pdftext = GetPDFTextObjects(api, width, height);
896  long pdftext_len = strlen(pdftext);
897  unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
898  size_t len;
899  unsigned char *comp_pdftext =
900  zlibCompress(pdftext_casted, pdftext_len, &len);
901  long comp_pdftext_len = len;
902  n = snprintf(buf, sizeof(buf),
903  "%ld 0 obj\n"
904  "<<\n"
905  " /Length %ld /Filter /FlateDecode\n"
906  ">>\n"
907  "stream\n", obj_, comp_pdftext_len);
908  if (n >= sizeof(buf)) {
909  delete[] pdftext;
910  lept_free(comp_pdftext);
911  return false;
912  }
913  AppendString(buf);
914  long objsize = strlen(buf);
915  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
916  objsize += comp_pdftext_len;
917  lept_free(comp_pdftext);
918  delete[] pdftext;
919  const char *b2 =
920  "endstream\n"
921  "endobj\n";
922  AppendString(b2);
923  objsize += strlen(b2);
924  AppendPDFObjectDIY(objsize);
925 
926  if (!textonly_) {
927  char *pdf_object = NULL;
928  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
929  return false;
930  }
931  AppendData(pdf_object, objsize);
932  AppendPDFObjectDIY(objsize);
933  delete[] pdf_object;
934  }
935  return true;
936 }
937 
938 
940  size_t n;
941  char buf[kBasicBufSize];
942 
943  // We reserved the /Pages object number early, so that the /Page
944  // objects could refer to their parent. We finally have enough
945  // information to go fill it in. Using lower level calls to manipulate
946  // the offset record in two spots, because we are placing objects
947  // out of order in the file.
948 
949  // PAGES
950  const long int kPagesObjectNumber = 2;
951  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
952  n = snprintf(buf, sizeof(buf),
953  "%ld 0 obj\n"
954  "<<\n"
955  " /Type /Pages\n"
956  " /Kids [ ", kPagesObjectNumber);
957  if (n >= sizeof(buf)) return false;
958  AppendString(buf);
959  size_t pages_objsize = strlen(buf);
960  for (size_t i = 0; i < pages_.size(); i++) {
961  n = snprintf(buf, sizeof(buf),
962  "%ld 0 R ", pages_[i]);
963  if (n >= sizeof(buf)) return false;
964  AppendString(buf);
965  pages_objsize += strlen(buf);
966  }
967  n = snprintf(buf, sizeof(buf),
968  "]\n"
969  " /Count %d\n"
970  ">>\n"
971  "endobj\n", pages_.size());
972  if (n >= sizeof(buf)) return false;
973  AppendString(buf);
974  pages_objsize += strlen(buf);
975  offsets_.back() += pages_objsize; // manipulation #2
976 
977  // INFO
978  STRING utf16_title = "FEFF"; // byte_order_marker
979  GenericVector<int> unicodes;
980  UNICHAR::UTF8ToUnicode(title(), &unicodes);
981  char utf16[kMaxBytesPerCodepoint];
982  for (int i = 0; i < unicodes.length(); i++) {
983  int code = unicodes[i];
984  if (CodepointToUtf16be(code, utf16)) {
985  utf16_title += utf16;
986  }
987  }
988 
989  char* datestr = l_getFormattedDate();
990  n = snprintf(buf, sizeof(buf),
991  "%ld 0 obj\n"
992  "<<\n"
993  " /Producer (Tesseract %s)\n"
994  " /CreationDate (D:%s)\n"
995  " /Title <%s>\n"
996  ">>\n"
997  "endobj\n",
998  obj_, TESSERACT_VERSION_STR, datestr, utf16_title.c_str());
999  lept_free(datestr);
1000  if (n >= sizeof(buf)) return false;
1001  AppendPDFObject(buf);
1002  n = snprintf(buf, sizeof(buf),
1003  "xref\n"
1004  "0 %ld\n"
1005  "0000000000 65535 f \n", obj_);
1006  if (n >= sizeof(buf)) return false;
1007  AppendString(buf);
1008  for (int i = 1; i < obj_; i++) {
1009  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
1010  if (n >= sizeof(buf)) return false;
1011  AppendString(buf);
1012  }
1013  n = snprintf(buf, sizeof(buf),
1014  "trailer\n"
1015  "<<\n"
1016  " /Size %ld\n"
1017  " /Root %ld 0 R\n"
1018  " /Info %ld 0 R\n"
1019  ">>\n"
1020  "startxref\n"
1021  "%ld\n"
1022  "%%%%EOF\n",
1023  obj_,
1024  1L, // catalog
1025  obj_ - 1, // info
1026  offsets_.back());
1027  if (n >= sizeof(buf)) return false;
1028  AppendString(buf);
1029  return true;
1030 }
1031 } // namespace tesseract
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint])
struct TessBaseAPI TessBaseAPI
Definition: capi.h:86
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
TessPDFRenderer(const char *outputbase, const char *datadir)
long dist2(int x1, int y1, int x2, int y2)
virtual bool AddImageHandler(TessBaseAPI *api)
static bool UTF8ToUnicode(const char *utf8_str, GenericVector< int > *unicodes)
Definition: unichar.cpp:211
virtual bool BeginDocumentHandler()
T & back() const
virtual bool EndDocumentHandler()
int push_back(T object)
const char * title() const
Definition: renderer.h:81
const char * c_str() const
Definition: strngs.cpp:212
#define tprintf(...)
Definition: tprintf.h:31
Definition: strngs.h:44
void AppendString(const char *s)
Definition: renderer.cpp:101
int size() const
Definition: genericvector.h:72
int length() const
Definition: genericvector.h:79
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
void AppendData(const char *s, int len)
Definition: renderer.cpp:105
double prec(double x)
#define TESSERACT_VERSION_STR
Definition: baseapi.h:23
void Swap(T *p1, T *p2)
Definition: helpers.h:90