--- title: "File encoding detection in R" author: "Stéphane Laurent" date: "2017-10-24" output: md_document: variant: markdown preserve_yaml: true html_document: keep_md: no tags: R highlighter: 'pandoc-solarized' --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse=TRUE) ``` There is a Java port of `universalchardet`, the encoding detector library of Mozilla. It is called `juniversalchardet`. I'm going to show how to use it with the `rJava` package. Firstly, download the `jar` file here: . Then, proceed as follows: ```{r} library(rJava) .jinit() .jaddClassPath("path/to/juniversalchardet-1.0.3.jar") detector <- new(J("org/mozilla/universalchardet/UniversalDetector"), NULL) f <- "juniversalchardet.Rmd" # file whose encoding you want to know flength <- as.integer(file.size(f)) .jcall(detector, "V", "handleData", readBin(f, what="raw", n=flength), 0L, flength) .jcall(detector, "V", "dataEnd") .jcall(detector, "S", "getDetectedCharset") .jcall(detector, "V", "reset") ``` # Update 2020 Nowadays one can achieve the same result with the R package `uchardet` (which does not use Java).