# Cosine Similarity - Approach B

- Loads embeddings (pre-computed from texts)
- Computes **mean of embedding**-arrays
- Computes **difference value** based on embeddings values
- Does NOT use cosine similarity in current state

In [1]:
# Configuration

# https://hobbitdata.informatik.uni-leipzig.de/EML4U/2021-02-10-Wikipedia-Texts/
source_texts_directory = "/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/"
# https://hobbitdata.informatik.uni-leipzig.de/EML4U/2021-04-07-Wikipedia-Embeddings/
embeddings_directory = "/home/eml4u/EML4U/data/wikipedia-embeddings/"

# points of time
id_a = "20100408"
id_b = "20201101"
# category ids
id_american = "american-films"
id_british = "british-films"
id_indian = "indian-films"
# file ids
id_american_a = id_a + "-" + id_american
id_american_b = id_b + "-" + id_american
id_british_a = id_a + "-" + id_british
id_british_b = id_b + "-" + id_british
id_indian_a = id_a + "-" + id_indian
id_indian_b = id_b + "-" + id_indian

In [2]:
# Imports

import numpy
print("numpy: " + numpy.version.version)

# Class instance to access data (wp texts, pre-computed embeddings)
import data_access
data_accessor = data_access.DataAccess(source_texts_directory, embeddings_directory)

numpy: 1.19.2


In [3]:
# Load embeddings
embeddings_british_a = data_accessor.load_embeddings(id_british_a)
embeddings_british_b = data_accessor.load_embeddings(id_british_b)
print()

# Compute means
def get_mean(embeddings, note = "", printinfo = True):
 mean = numpy.mean(embeddings, axis=0)
 if printinfo:
 print(str(type(mean)) + " " + str(mean.shape) + " " + note)
 return mean

mean_british_a = get_mean(embeddings_british_a, "BritishA")
mean_british_b = get_mean(embeddings_british_b, "BritishB")

/home/eml4u/EML4U/data/wikipedia-embeddings/20100408-british-films.txt
(2147, 768) <class 'numpy.ndarray'>
/home/eml4u/EML4U/data/wikipedia-embeddings/20201101-british-films.txt
(2147, 768) <class 'numpy.ndarray'>

<class 'numpy.ndarray'> (768,) BritishA
<class 'numpy.ndarray'> (768,) BritishB


In [4]:
# Differences of arrays as one value
def differenceValue(a, b):
 x = 0
 for i in range(len(a)):
 x += abs(a[i] - b[i])
 return x;
# Test
if False:
 print(differenceValue(numpy.array([1,2,3,4]), numpy.array([1.1,2.2,3.3,4.4])))

# Print source texts
def print_source_text(directory, category_id, index):
 print()
 print("Category: " + category_id)
 print("Index: " + str(index))
 file = data_accessor.get_embeddings_dict_filename(category_id, index);
 print("File: ")
 print(data_accessor.read_source_text(directory, file))
 print()

In [5]:
# Compute difference values between embeddings of single texts and mean-embeddings
# Array:
# [0] index (to look up source texts)
# [1] difference to mean t1 (== A)
# [2] difference to mean t2 (== B)
differences = []
for i in range(len(mean_british_a)):
 differences.append((i, differenceValue(mean_british_a, embeddings_british_a[i]), differenceValue(mean_british_b, embeddings_british_b[i])))

# Sort by largest difference
differences_a = sorted(differences, key=lambda tup: tup[1], reverse=True)
differences_b = sorted(differences, key=lambda tup: tup[2], reverse=True)

print("Largest difference values to mean of A")
print(differences_a[0])
print(differences_a[1])
print(differences_a[2])
print("...")
print(differences_a[len(differences_a)-2])
print(differences_a[len(differences_a)-1])
print()

print("Largest difference values to mean of B")
print(differences_b[0])
print(differences_b[1])
print(differences_b[2])
print("...")
print(differences_b[len(differences_b)-2])
print(differences_b[len(differences_b)-1])

Largest difference values to mean of A
(721, 127.63656949018497, 63.07215986661895)
(333, 126.23330120916125, 126.63447396310393)
(680, 122.11133584967376, 129.0560627009414)
...
(391, 40.37050334666524, 45.36463767773769)
(393, 39.05389511333612, 56.23254257812948)

Largest difference values to mean of B
(179, 114.6978323360855, 131.07256570494872)
(680, 122.11133584967376, 129.0560627009414)
(381, 88.82879530042067, 127.66585960996093)
...
(334, 51.76971304423001, 39.533962588139545)
(309, 56.43851055612732, 39.04979437720267)


In [6]:
# Explore underlying texts (Largest difference values to mean of A)
if True:
 print_source_text(id_british_a, id_british, differences_a[0][0])
 print_source_text(id_british_b, id_british, differences_a[0][0])
 #print_source_text(id_british_a, id_british, differences_a[1][0])
 #print_source_text(id_british_b, id_british, differences_a[1][0])


Category: british-films
Index: 721
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20100408-british-films/Bloody_Kids.txt
Bloody Kids is a 1979 film directed by Stephen Frears.
External links
- 
Category:1979 films Category:British films Category:Films directed by
Stephen Frears



Category: british-films
Index: 721
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20201101-british-films/Bloody_Kids.txt
Bloody Kids is a British television film written by Stephen Poliakoff
and directed by Stephen Frears, made by Black Lion Films for ATV, and
first shown on ITV on 22 March 1980.
Cast
- Derrick O'Connor as Detective Ritchie (Richard Beckinsale originally
 cast before his sudden death)
- Gary Holton as Ken
- Richard Thomas as Leo Turner
- Peter Clark as Mike Simmonds
- Gwyneth Strong as Jan, Ken's Girlfriend
- Caroline Embling as Susan, Leo's Sister
- Jack Douglas as Senior Police Officer
- Billy Colvill as Williams
- P.H. Moriarty as Police 1
- Richard Hope 

In [7]:
# Explore underlying texts (Largest difference values to mean of B)
if True:
 print_source_text(id_british_a, id_british, differences_b[0][0])
 print_source_text(id_british_b, id_british, differences_b[0][0])
 #print_source_text(id_british_a, id_british, differences_b[1][0])
 #print_source_text(id_british_b, id_british, differences_b[1][0])


Category: british-films
Index: 179
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20100408-british-films/The_World_Is_Rich.txt
The World Is Rich is a 1947 documentary film directed by Paul Rotha. It
was nominated for an Academy Award for Best Documentary Feature.
References
External links
- 
fr:The World Is Rich
Category:1947 films Category:British films Category:English-language
films Category:British documentary films Category:Black-and-white films
Category:Films directed by Paul Rotha



Category: british-films
Index: 179
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20201101-british-films/The_World_Is_Rich.txt
The World Is Rich is a 1947 British documentary film directed by Paul
Rotha. It was nominated for an Academy Award for Best Documentary
Feature..
References
External links
- 
Category:1947 films Category:1947 documentary films Category:British
films Category:English-language films Category:British documentary films
Category:Black-and-white 

In [8]:
# Explore embeddings of two points of time directly

differences_direct = []
for i in range(len(mean_british_a)):
 differences_direct.append((i, differenceValue(embeddings_british_a[i], embeddings_british_b[i])))
 
differences_direct_sorted = sorted(differences_direct, key=lambda tup: tup[1], reverse=True)

print("Largest difference values")
print(differences_direct_sorted[0])
print(differences_direct_sorted[1])
print(differences_direct_sorted[2])
print("...")
print(differences_direct_sorted[len(differences_direct_sorted)-2])
print(differences_direct_sorted[len(differences_direct_sorted)-1])
print()

print_source_text(id_british_a, id_british, differences_direct_sorted[0][0])
print_source_text(id_british_b, id_british, differences_direct_sorted[0][0])

Largest difference values
(721, 152.96771019252628)
(610, 141.938466045307)
(409, 141.68865489200107)
...
(422, 5.0086785865423735)
(435, 4.515663030353608)


Category: british-films
Index: 721
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20100408-british-films/Bloody_Kids.txt
Bloody Kids is a 1979 film directed by Stephen Frears.
External links
- 
Category:1979 films Category:British films Category:Films directed by
Stephen Frears



Category: british-films
Index: 721
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20201101-british-films/Bloody_Kids.txt
Bloody Kids is a British television film written by Stephen Poliakoff
and directed by Stephen Frears, made by Black Lion Films for ATV, and
first shown on ITV on 22 March 1980.
Cast
- Derrick O'Connor as Detective Ritchie (Richard Beckinsale originally
 cast before his sudden death)
- Gary Holton as Ken
- Richard Thomas as Leo Turner
- Peter Clark as Mike Simmonds
- Gwyneth Strong as Jan, Ken's Girlfr

In [9]:
print_source_text(id_british_a, id_british, differences_direct_sorted[1][0])
print_source_text(id_british_b, id_british, differences_direct_sorted[1][0])


Category: british-films
Index: 610
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20100408-british-films/Law_and_Disorder_1958_film.txt
Law and Disorder is a 1958 British comedy film directed by Charles
Crichton and starring Michael Redgrave, Robert Morley, Joan Hickson,
Lionel Jeffries and John Le Mesurier.
External links
- 
Category:1958 films Category:1950s comedy films Category:British films
Category:English-language films Category:Films directed by Charles
Crichton



Category: british-films
Index: 610
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20201101-british-films/Law_and_Disorder_1958_film.txt
Law and Disorder}} Law and Disorder is a 1958 British comedy film
directed by Charles Crichton and starring Michael Redgrave, Robert
Morley, Joan Hickson, Lionel Jeffries. It was based on the 1954 novel
Smugglers' Circuit by Denys Roberts. The film was started by director
Henry Cornelius who died whilst making the film. He was replaced by
Charles Cr

In [10]:
print_source_text(id_british_a, id_british, differences_direct_sorted[len(differences_direct_sorted)-1][0])
print_source_text(id_british_b, id_british, differences_direct_sorted[len(differences_direct_sorted)-1][0])


Category: british-films
Index: 435
File: 
/home/eml4u/EML4U/data/corpus/2021-02-10-wikipedia-texts/20100408-british-films/Great_Moments_in_Aviation.txt
Great Moments in Aviation is a 1994 romantic drama film, set on a 1950s
passenger liner. The film follows Gabriel Angel (Rakie Ayola), a young
Caribbean aviator who falls in love with the forger Duncan Stewart
(Jonathan Pryce) on her journey to England. Stewart is pursued by his
nemesis Rex Goodyear (John Hurt), and the group are supported by Dr
Angela Bead (Vanessa Redgrave) and Miss Gwendolyn Quim (Dorothy Tutin),
retired missionaries who become lovers during the voyage.
The film was written by Jeanette Winterson, directed by Beeban Kidron
and produced by Phillippa Gregory, the same creative team that
collaborated on Winterson's Oranges Are Not the Only Fruit in 1990.
Winterson intended the screenplay to be reminiscent of a fairytale, and
was unhappy at being asked to write a new ending for its American
release.
The film was shown at