# Characters

We show the characters that figure in the text extraction from the Lakhnawi PDF.

Reference: [lakhnawi](https://among.github.io/fusus/fusus/lakhnawi.html).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from fusus.lakhnawi import Lakhnawi

# Check character use

We start with extracting textual data from all pages.

But first we will comment out the rule that deals with the private use characters
e840, e864, e888, e8df, which are all kasras.

We do this in the variable `REPLACE_DEF` in 
[lakhnawi.py](https://github.com/among/fusus/blob/0c25a217b00938b46ac6b65b17411e76819af9c9/fusus/lakhnawi.py#L432-L435)

```
# e840                => 0650           : KASRA
# e864                => 0650           : KASRA
# e888                => 0650           : KASRA
# e8df                => 0650           : KASRA
```

In [3]:
Lw = Lakhnawi()
Lw.setStyle()
testPages = None

In [4]:
Lw.getPages(testPages)

	 438            

We ask for the private use characters that have not been treated.

In [5]:
Lw.showUsedChars(testPages, onlyPuas=True, onlyPresentational=False, long=False, byOcc=True)

0,1,2
e864  ??,14751 on 432 pages,e.g. page 336 with 83 occurrences
e888  ??,1450 on 364 pages,e.g. page 247 with 16 occurrences
e8df  ??,799 on 299 pages,e.g. page 52 with 11 occurrences
e840  ??,230 on 143 pages,e.g. page 172 with 8 occurrences


We see that not all private use characters have been resolved, and we are pointed to
the page with the largest number of occurrences of those letters.

Now we restore the rules (we delete the comment signs in the code, and run the text extraction again.

```
e840                => 0650           : KASRA
e864                => 0650           : KASRA
e888                => 0650           : KASRA
e8df                => 0650           : KASRA
```

In [4]:
Lw.getPages(testPages, refreshConfig=True)

	 438            

We check again:

In [5]:
Lw.showUsedChars(testPages, onlyPuas=True, onlyPresentational=False, long=False, byOcc=True)

Ok, all private use characters have been treated.

Now we want a complete overview of all characters in the corpus.

In [6]:
Lw.showUsedChars(testPages, onlyPuas=False, onlyPresentational=False, long=False, byOcc=False)

0,1,2
0020  SPACE,61198 on 438 pages,e.g. page 435 with 444 occurrences
0021  !  EXCLAMATION MARK,8 on 8 pages,e.g. page 41 with 1 occurrences
0028  (  LEFT PARENTHESIS,21 on 9 pages,e.g. page 428 with 5 occurrences
0029  )  RIGHT PARENTHESIS,23 on 11 pages,e.g. page 428 with 5 occurrences
"002c  ,  COMMA",9 on 1 page,e.g. page 430 with 9 occurrences
002d  -  HYPHEN-MINUS,2 on 1 page,e.g. page 430 with 2 occurrences
002e  .  FULL STOP,2707 on 414 pages,e.g. page 432 with 22 occurrences
002f  /  SOLIDUS,7 on 2 pages,e.g. page 426 with 4 occurrences
003a  :  COLON,1508 on 361 pages,e.g. page 435 with 21 occurrences
003d  =  EQUALS SIGN,1 on 1 page,e.g. page 433 with 1 occurrences


# Rule application

Here is an overview of which rules have been applied and how many times.

In [7]:
Lw.showReplacements()

0,1,2,3,4,5
rule 1,72199 x applied on 417 pages,e.g. page 435 with 697 applications,e821  ??,⇒,
rule 2,22833 x applied on 433 pages,e.g. page 52 with 112 applications,e825  ??,⇒,064e  َ  ARABIC FATHA
rule 14,14751 x applied on 432 pages,e.g. page 336 with 83 applications,e864  ??,⇒,0650  ِ  ARABIC KASRA
rule 23,8121 x applied on 427 pages,e.g. page 398 with 74 applications,e828  ??,⇒,0652  ْ  ARABIC SUKUN
rule 9,7419 x applied on 430 pages,e.g. page 357 with 45 applications,e826  ??,⇒,064f  ُ  ARABIC DAMMA
rule 6,6707 x applied on 426 pages,e.g. page 197 with 43 applications,e8e8  ??,⇒,064e  َ  ARABIC FATHA
rule 10,4099 x applied on 421 pages,e.g. page 355 with 31 applications,e8e9  ??,⇒,064f  ُ  ARABIC DAMMA
rule 29,2419 x applied on 413 pages,e.g. page 30 with 34 applications,e830  ??,⇒,064e  َ  ARABIC FATHA  0651  ّ  ARABIC SHADDA
rule 67,2385 x applied on 399 pages,e.g. page 295 with 20 applications,e845  ??,⇒,0655  ٕ  ARABIC HAMZA BELOW  0650  ِ  ARABIC KASRA
rule 50,2376 x applied on 386 pages,e.g. page 38 with 24 applications,0627  ا  ARABIC LETTER ALEF  e815  ??,⇒,0623  أ  ARABIC LETTER ALEF WITH HAMZA ABOVE  064e  َ  ARABIC FATHA


# Original characters

Here are the characters that are literally stored in the PDF, before any transformation.
All the private-use characters show up, and also a lot of pre-composed characters/special forms.

Note that there are no explicit spaces.

In [8]:
Lw.showUsedChars(testPages, orig=True, onlyPuas=False, onlyPresentational=False, long=False, byOcc=False)

0,1,2
0021  !  EXCLAMATION MARK,8 on 8 pages,e.g. page 41 with 1 occurrences
0028  (  LEFT PARENTHESIS,1213 on 371 pages,e.g. page 217 with 11 occurrences
0029  )  RIGHT PARENTHESIS,1211 on 371 pages,e.g. page 217 with 11 occurrences
"002c  ,  COMMA",9 on 1 page,e.g. page 430 with 9 occurrences
002d  -  HYPHEN-MINUS,2 on 1 page,e.g. page 430 with 2 occurrences
002e  .  FULL STOP,2707 on 414 pages,e.g. page 432 with 22 occurrences
002f  /  SOLIDUS,7 on 2 pages,e.g. page 426 with 4 occurrences
0030  0  DIGIT ZERO,128 on 91 pages,e.g. page 381 with 5 occurrences
0031  1  DIGIT ONE,243 on 129 pages,e.g. page 70 with 7 occurrences
0032  2  DIGIT TWO,242 on 134 pages,e.g. page 139 with 6 occurrences


# Final characters

Here is the subset of final form characters that cause a space to be inserted after.
Here they are, plus statistics of how often they have been encountered.

In [9]:
Lw.showFinals()

0,1,2
5006 x applied on 427 pages,e.g. page 431 with 40 applications,feea  ﻪ  ARABIC LETTER HEH FINAL FORM
2852 x applied on 426 pages,e.g. page 432 with 28 applications,fef2  ﻲ  ARABIC LETTER YEH FINAL FORM
2818 x applied on 419 pages,e.g. page 434 with 51 applications,fee6  ﻦ  ARABIC LETTER NOON FINAL FORM
2696 x applied on 405 pages,e.g. page 433 with 30 applications,fee2  ﻢ  ARABIC LETTER MEEM FINAL FORM
2304 x applied on 412 pages,e.g. page 4 with 62 applications,fe94  ﺔ  ARABIC LETTER TEH MARBUTA FINAL FORM
1574 x applied on 395 pages,e.g. page 343 with 15 applications,fc90  ﲐ  ARABIC LIGATURE ALEF MAKSURA WITH SUPERSCRIPT ALEF FINAL FORM
1428 x applied on 389 pages,e.g. page 112 with 14 applications,fede  ﻞ  ARABIC LETTER LAM FINAL FORM
946 x applied on 340 pages,e.g. page 90 with 21 applications,feda  ﻚ  ARABIC LETTER KAF FINAL FORM
788 x applied on 303 pages,e.g. page 162 with 14 applications,fed6  ﻖ  ARABIC LETTER QAF FINAL FORM
751 x applied on 295 pages,e.g. page 435 with 17 applications,fe96  ﺖ  ARABIC LETTER TEH FINAL FORM


# Double characters

Here are the characters that have a double unicode assignment in one of the fonts.
Without action, they will cause two characters extracted instead of one.

Here they are, plus statistics of how often they have been encountered.

In [10]:
Lw.showDoubles()

0,1,2
27190 x applied on 438 pages,e.g. page 393 with 107 applications,fe8d  ﺍ  ARABIC LETTER ALEF ISOLATED FORM  ⇒ 0627  ا  ARABIC LETTER ALEF
22536 x applied on 435 pages,e.g. page 24 with 117 applications,fe76  ﹶ  ARABIC FATHA ISOLATED FORM  ⇒ 064e  َ  ARABIC FATHA
9939 x applied on 437 pages,e.g. page 71 with 67 applications,feed  ﻭ  ARABIC LETTER WAW ISOLATED FORM  ⇒ 0648  و  ARABIC LETTER WAW
7347 x applied on 424 pages,e.g. page 250 with 42 applications,fe7a  ﹺ  ARABIC KASRA ISOLATED FORM  ⇒ 0650  ِ  ARABIC KASRA
5281 x applied on 424 pages,e.g. page 311 with 37 applications,fe7e  ﹾ  ARABIC SUKUN ISOLATED FORM  ⇒ 0652  ْ  ARABIC SUKUN
4935 x applied on 436 pages,e.g. page 383 with 43 applications,fead  ﺭ  ARABIC LETTER REH ISOLATED FORM  ⇒ 0631  ر  ARABIC LETTER REH
4652 x applied on 422 pages,e.g. page 398 with 31 applications,fe78  ﹸ  ARABIC DAMMA ISOLATED FORM  ⇒ 064f  ُ  ARABIC DAMMA
3505 x applied on 432 pages,e.g. page 217 with 34 applications,fee5  ﻥ  ARABIC LETTER NOON ISOLATED FORM  ⇒ 0646  ن  ARABIC LETTER NOON
2792 x applied on 427 pages,e.g. page 402 with 38 applications,06f1  ۱  EXTENDED ARABIC-INDIC DIGIT ONE  ⇒ 0661  ١  ARABIC-INDIC DIGIT ONE
2563 x applied on 424 pages,e.g. page 128 with 24 applications,fea9  ﺩ  ARABIC LETTER DAL ISOLATED FORM  ⇒ 062f  د  ARABIC LETTER DAL
