# sdg-text-retriever
A program to retrieve texts related to Sustainable Development Goals from webs and PDF's

You can use this code to extract multiple types of text related to SDG's. First of all, you can get Goteo projects' description. To achieve that, I've used Goteo's API. You can try it by calling getProjectsFromGoteo function in the main class with the page number as a parameter. Each page contains 50 projects, and the information I persist in the database is the project name, short description, complete description and the owner ID. I have used RestTemplate framework to develop the request logic. There's a problem you should have in mind when making request to Goteo's services: there's a restriction in the number of request we can make in a short period of time. After that, you will always get a 429 error (TOO MANY REQUESTS), and you'll have to wait. 

You can also get texts from SDG websites (like this one https://www.un.org/sustainabledevelopment/es/hunger/). For that, just call the function scrapWebONUX, where X is the SDG's number, passing the SGD'S website URL as a parameter. I persist the text obtained in my database. I have used JSoup library to do the web scrapping in Java.

At last, if you want to retrieve texts from PDF documents, you can use several functions in PdfParser class. If you need to get the entire text in the PDF document, you shouyld use getTextFromPDF function, passing the pdf path and the SDG number that PDF is related to. You can use the other functions to get text from other PDFs have different formats and you want to avoid including elemnts such as tables, references or page numbers. I have used PDFBox library to make the PDF parsing in Java.