Blog

Extracting content from .pdf files

One of common question I get as a data science consultant involves extracting content from .pdf files. In the best-case scenario the content can be extracted to consistently formatted text files and parsed from there into a usable form. In the worst case the file will need to be run through an optical character recognition (OCR) program to extract the text.

Overview of available tools

For years pdftotext from

Read more about Extracting content from .pdf files

Escaping from character encoding hell in R on Windows

Note: the title of this post was inspired by this question on stackoverflow.

This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.

If you are on a deadline and just need to get the job done this section should be all you need.

Read more about Escaping from character encoding hell in R on Windows

Create Choropleth Maps in R

USA Choropleth maps are a means of visualizing spatial data by shading or patterning areas of a map in proportion to the values of a variable. This kind of map can provide insight into how a variable is distributed across a geographic area or the level of variability within a region. 

Read more about Create Choropleth Maps in R