{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "CWPK \\#36: Bulk Modification Techniques\n", "============================\n", "\n", "vlookup is Your Friend\n", "--------------------------\n", "\n", "
\n", " =VLOOKUP(A1,map,2,0)\n", "\n", "\n", "In this example,
A1
is the item to be looked up in the 'standard' block. If we copy this formula down all rows of the 'standard' block, all items tested for matches will be in column A. The map
reference in the formula refers to the 'map' named block. The 2
(in reference to the 1 to N above) tells the formula to return the information in column 2 of 'map' if a match occurs in column 1 (which is always a condition of *vlookup*). The 0
is a flag in the formula indicating only an exact match will return a value. If no match occurs, the formula indicates #N/A
, otherwise the value is the content of the column cell (2
in this case) matched from 'map'. \n",
"\n",
"If, after doing a complete *vlookup* I find the results not satisfactory, I can undo. If I find the results satisfactory, I highlight the entire *vlookup* column, copy it, and then paste it back into the same place with text and results only. This converts the formulas to actual transferred values and then I can proceed to next steps, such as moving the column into the block, adding some prefixes, fixing some string differences, etc. After incorporation of the accepted results, it is important to make sure our 'standard' block reflects the additional column information. \n",
"\n",
"Particularly when dealing with annotations, where some columns may contain quite long strings, I do two things, occasioned by the fact that opening a CSV file causes column widths to adjust to the longest entry. First, I do not allow any of the cells to word wrap. This prevents rows becoming variable heights, which I find difficult to use. Second, I highlight the entire spreadsheet (via the upper left open header cell), and then set all columns to the same width. This solves the pain of scrolling left or right where some columns are too wide. \n",
"\n",
"It takes a few iterations to get the hang of the *vlookup* function, but, once you do, you will be using it for many of the bulk activities listed in the intro. *vlookup* is a powerful way to check unions (do single-column lookups both ways), intersections, differences, duplicates, and the transfer of new values to incorporate into records.\n",
"\n",
"Like other bulk activities, also be attentive to backups and saving of results as you proceed through multi-step manipulations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other General Spreadsheet Tips\n",
"Here are some other general tips for using spreadsheets, organized by topic.\n",
"\n",
"#### Sorts\n",
"Named blocks are a good best practice, especially for sorts, which are a frequent activity during bulk manipulations. However, sorts done wrong have the potential to totally screw up your information. Remember, our extracts from KBpedia are, at minimum, a [semantic triple](https://en.wikipedia.org/wiki/Semantic_triple), and in the case of annotation extractions, multiple values per subject. This extracted information is written out as records, one after another, row by row. The correspondence of items to one another, in its most basic form the *s-p-o*, is a basic statement or assertion. If we do not keep these parts of *subject - verb - object* together, our statements become gibberish. Let's illustrate this by highlighting one record -- one statement -- in an example KBpedia extraction table:\n",
"\n",
"=EXACT(B1,B2)
formula in the cell (the two cells are to the immediate right and then one above that for the two arguments). If the content in B1
and \n",
"B2
are exactly the same, the formula will evaluate to TRUE
, if not FALSE
. Copy that formula down all rows adjacent to the block.\n",
"\n",
"Every row marked with TRUE
is a duplicate with respect to Col B. If you want to remove these duplicates, copy the entire formula column, paste it back as text and values only, and then sort that column and your source block. You can then delete *en masse* all rows with duplicates (TRUE
).\n",
"\n",
"You can test for duplicate matter across columns with the same technique. Using the =CONCATENATE()
operator, you may temporarily combine values from multiple columns. Create this synthetic concatenation in its own column, copy it down all block rows, and then test for duplicates with the =EXACT()
operator as above.\n",
"\n",
"#### Search and Replace\n",
"The search function in spreadsheets goes well beyond normal text and includes attributes (like bolding), structural characters like tabs or line feeds, or regular expressions ([regex](https://en.wikipedia.org/wiki/Regular_expression)). Regex is a particularly powerful capability that few know, but unlocks tremendous power. However, an exploration of regex is beyond the scope of this **CWPK** series. I have found simple stuff like recognizing capitalization or conditional replacements to be very helpful, but basic understanding let alone mastery of regex requires a substantial learning commitment.\n",
"\n",
"I use the search box below the actual spreadsheet for repeat search items. So, while I will use the search dialog for complicated purposes, I put the repeated search queries here. To screen against false matches, I also use the capitalization switch and also try to find larger substrings that embed the fragment I am seeking but removes adjacent text that fails my query needs.\n",
"\n",
"Another useful technique is to only search within a selection, which is selected by a radiobutton on the search dialog. Highlighting a single column, for example, or some other selection boundary like a block, enables local replacements without affecting other areas of the sheet. \n",
"\n",
"#### String Manipulations\n",
"One intimidating factor of spreadsheets is the number of functions they have. However, hidden in this library are many string manipulation capabilities, generally all found under the 'Text' category of functions. I have already mentioned =CONCATENATE()
and =EXACT()
. Other string functions I have found useful are =TRIM()
(removes extra spaces), =CLEAN()
(removes unprintable characters), =FIND()
(find substrings, useful for flagging entries with shared characteristics), and =RIGHT()
(testing the last character is a string). These kinds of functions can be helpful in cleaning up entries as well as finding stuff within large, bulk files.\n",
"\n",
"There are quite a few string functions for changing case and converting formats, I tend to use these less than the many main menu options found under Format → Text. \n",
"\n",
"These functions can often be combined in surprising ways. Here are two examples of string manipulations that are quite useful (may need to adjust cell references):\n",
"\n",
"For switching person first and last names (where the target is in A17):\n",
"\n",
"\n", " =MID(A17,FIND(\" \",A17)+1,1024)&\", \"&LEFT(A17,FIND(\" \",A17)-1)\n", "\n", "\n", "For singularizing most plurals (-ies to -y not covered, for example):\n", "\n", "
\n", " =IF(OR(RIGHT(A1,1)=\"s\",RIGHT(A1,2)=\"es\"),IF(RIGHT(A1,2)=\"es\",LEFT(A1,(LEN(A1)-2)),LEFT(A1,(LEN(A1)-1))),A1)\n", "\n", "\n", "This does not capture all plural variants, but others may be added given the pattern.\n", "\n", "Often a bit of online searching will turn up other gems, depending on what your immediate string manipulation needs may be.\n", "\n", "#### Other\n", "One very useful capability, but close to buried in LibreOffice, is the Data → Text to Columns option. It is useful to splitting a column into two or more cells based on a given character or attribute, useful to long strings or other jumbled content. Invoke the dialog on your own spreadsheet to see the dialog for this option. There are many settings for how to recognize the splitting character, each occurrence of which causes a new cell to be populated to the right. Thus, it is best to have the target column with the long strings at the far right of your block (since when it makes splits, it populates cells to the right, but only if the splitting condition is met. Thus, if existing columns of information exist to the right, they will become jagged and out of sync." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Representations in Python\n", "\n", "We are already importing the *csv* module into *cowpoke*. However, there is a supplement to that standard that provides a bit more functionality called [CleverCSV](https://github.com/alan-turing-institute/clevercsv). I have not tested it. There is also a utility to combine [CSV files with glob](http://goodbody.io/combining-csv-files-with-glob/), which relates more to *pandas* and is also a utility I have not used. \n", "\n", "Please note there are additional data representation tips involving Python in [**CWPK #21**](https://www.mkbergman.com/2353/cwpk-21-some-accumulated-tips/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tips for Some Other Tools\n", "As we alluded to in [**CWPK #25**](https://www.mkbergman.com/2358/cwpk-25-querying-kbpedia-with-sparql/), [Wikipedia](https://en.wikipedia.org/wiki/Main_Page), [DBpedia](https://en.wikipedia.org/wiki/DBpedia), and [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) are already mapped to KBpedia and provide rich repositories of instance data retrievable via [SPARQL](https://en.wikipedia.org/wiki/SPARQL). (A later installment will address this topic.) The results sets from these queries may be downloaded as flat files that can be manipulated with all of these CSV techniques. Indeed, retrievals from these sources have been a key source for populating much of the annotation information already in KBpedia.\n", "\n", "You can follow this same method to begin creating your own typologies or add instances or expand the breadth or depth of a given topic area. The basic process is to direct a SPARQL query to the source, download the results, and then manipulate the CSV file for incorporation into one of your knowledge graph's extraction files for the next build iteration. \n", "\n", "Sample [Wikidata queries](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples), numbering into the hundreds, are great studying points for SPARQL and sometimes templates for your own queries. I also indicated in **CWPK #25** how the SPARQL VALUE statement may be used to list identifiers for bulk retrievals from these sources. \n", "\n", "You can also use the [Wikipedia and Wikidata Tools](https://meta.wikimedia.org/wiki/Wikipedia_and_Wikidata_Tools) plug-in for [Google spreadsheets](https://gsuite.google.com/marketplace/app/wikipedia_and_wikidata_tools/595109124715?pann=cwsdp&hl=en) to help populate tables that can be exported as CSV for incorporation. You should also check out [OpenRefine](https://openrefine.org/) for data wrangling tasks. OpenRefine is very popular with some practitioners, and I have used it on occasion when some of the other tools listed could not automate my task.\n", "\n", "Though listed last, text editors are often the best tool for changes to bulk files. In these cases, we are now editing the flat file directly, and not through a column and row presentation in the spreadsheet. As long as we are cognizant and do not overwrite comma delimiters and quoted long strings, the separate text and control attributes such as tabs or carriage returns can be manipulated with the different functions these applications bring." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Conclusion to this Part\n", "The completion of this installment means we have made the turn on our roundtrip quest. We have completed our first extraction module and have explained a bit how we can modify and manipulate the bulk files that result from our extraction routines.\n", "\n", "In our next major part of the **CWPK** series we will use what we have learned to lay out a more complete organization of the project, as well as to complete our roundtripping with the addition of build routines. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional Documentation\n", "\n", "As noted, there are multiple sources from multiple venues to discuss how to use spreadsheets effectively, many with specific reference to CSV files. We also have many online sources that provide guidance on getting data from external endpoints using SPARQL, the mapping of results from which is one of the major reasons for making bulk modifications to our extraction files. Here are a few additional sources directly relevant to these topics:\n", "\n", "- Wikidata's own [SPARQL tutorial](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial)\n", "- Bob DuCharme on getting [full text from Wikipedia with SPARQL](http://www.snee.com/bobdc.blog/2018/10/sparql-full-text-wikipedia-sea.html)\n", "- [Querying Wikidata with Python and SPARQL](https://towardsdatascience.com/where-do-mayors-come-from-querying-wikidata-with-python-and-sparql-91f3c0af22e2) from [Towards Data Science](https://towardsdatascience.com/)\n", "- Data Science at the Command Line's [scrubbing data](https://www.datascienceatthecommandline.com/chapter-5-scrubbing-data.html).\n", "\n", "\n", "
*.ipynb
file. It may take a bit of time for the interactive option to load.