# About

This notebook explores the conversion dataset.

The `conversions.tsv` dataset has one row per search conversion.  

The dataset tells you which photo has been downloaded for a search, the country of origin, and an anonymous identifier to indiciate the unique users. 

[Source](https://github.com/unsplash/datasets/blob/master/DOCS.md)


We will use this dataset to understand the type of queries, that users in the platform are searching.

# Exploring the data

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', 100)


In [3]:
path = "../data/raw/conversions.tsv000"


In [4]:
df = pd.read_csv(path,sep="\t")

In [5]:
len(df)

12166088

sample view of the data

In [6]:
df.head()

Unnamed: 0,converted_at,conversion_type,keyword,photo_id,anonymous_user_id,conversion_country
0,2020-07-29 00:08:04.221,download,clouds,ABmygVJcYgY,dd01ebdd-7691-4518-ab19-b2105782ae8b,VE
1,2020-07-29 00:25:23.426,download,shark,fB2jl6Rb3l4,c48ba6e0-c6a7-4a92-b569-fe57808a8a2c,QA
2,2020-07-29 00:26:13.122,download,dogs,k1hbfag2na0,62c4f043-579c-438f-8815-eb8ba3c54d34,KR
3,2020-07-29 00:37:03.308,download,astronaut,-SyUjRlHauQ,7ad6dc18-a02e-4ba2-b93c-fd7ea2e551d8,JP
4,2020-07-29 00:54:28.942,download,red roses,A0iTJUhK4es,f03a5708-32e4-4fae-8210-3c5d2632cbfb,NZ


Get top queries

In [7]:
df_res = df.groupby(["keyword"], as_index=False)\
            .size()\
            .sort_values("size", ascending=False)\
            .rename(columns={'size':'num_searches'})

In [8]:
print (f"Number of unique queries: {len(df_res)} ")

Number of unique queries: 569996 


In [9]:
df_res.head(30)

Unnamed: 0,keyword,num_searches
334943,nature,381173
445718,sky,239848
193034,flowers,202391
333735,natural,196189
189492,flower,175126
431887,sea,165744
325200,mountain,161816
198609,forest,153677
350461,ocean,145435
45100,beach,136862


## What can we say about the typical queries ?

- Most of the queries seem to be under <3 keywords.
- Users in the platform are interested in nature
- no normalizations is done for the queries; animal vs animals ; vs mountain vs mountains

Queries like above with "broad" intent are not that useful for comparing results

## Exploring Longer Queries

In [10]:
df_res["num_keywords"] = df_res["keyword"].apply(lambda x: len(x.split(" ")))

In [11]:
df_long_queries = df_res[(df_res["num_keywords"] > 1) ]

In [12]:
df_long_queries[df_long_queries.num_keywords > 4].head(50)

Unnamed: 0,keyword,num_searches,num_keywords
327457,mountain star landscape night sky,779,5
287590,light at the end of the tunnel,308,7
499894,there is no planet b,242,5
276561,"lago di braies, braies, italy",118,5
534678,water droplets on a leaf,106,5
224699,great sand dunes national park,94,5
258115,image of a man in a desert,82,7
274846,"konkan beach resort, ratnagiri, india",73,5
335652,nature backgrounds water ripple reflection,67,5
459722,south georgia and the south sandwich islands,54,7


## Interesting Queries

Detailed Intent
- water droplets on a leaf	
- image of a man in a desert	
- person on top of mountain	



Location:
- ripley's aquarium of canada, toronto, canada	
- the butterfly atrium at hershey gardens	

Non English Queries
- salar de uyuni uyuni bolivia	
- 沙漠青蛙 沙漠青蛙 desert frog	
- por do sol no mar	
- conhece te a ti mesmo	 ( Greek for know thyself)


Metaphors / Slogan:
- light at the end of the tunnel	
- there is no planet b	

Multiple Candidates
- seven wonders of the world	

Long Query / Single Intent
- nova scotia duck tolling retriever	 ( dog breed)


Non frequently searched queries

In [13]:
df_long_queries[df_long_queries.num_keywords > 4].tail(50)

Unnamed: 0,keyword,num_searches,num_keywords
313105,mid night star picture for youtube thumbnail,1,7
119583,cool gamer pics for free,1,5
313060,mid century gothic style rose painting,1,6
313079,mid century modern interior design,1,5
313077,mid century modern home interior,1,5
313076,mid century modern home decor,1,5
313160,middle aged women beauty,1,5
313148,middle age is an age of many colors.,1,8
313185,middle east night in the desert,1,6
313694,milky way at the sea,1,5
