Title: | Wrangle Campaign Finance Data |
---|---|
Description: | Explore and normalize American campaign finance data. Created by the Investigative Reporting Workshop to facilitate work on The Accountability Project, an effort to collect public data into a central, standard database that is more easily searched: <https://publicaccountability.org/>. |
Authors: | Kiernan Nicholls [aut, cre, cph], Investigative Reporting Workshop [cph], Yanqi Xu [aut], Schuyler Erle [cph] |
Maintainer: | Kiernan Nicholls <[email protected]> |
License: | CC BY 4.0 |
Version: | 1.0.11 |
Built: | 2025-03-10 02:33:19 UTC |
Source: | https://github.com/irworkshop/campfin |
%out%
is an inverted version of the infix %in%
operator.
x %out% table
x %out% table
x |
vector: the values to be matched. Long vectors are supported. |
table |
vector or |
%out%
is currently defined as
"%out%" <- function(x, table) match(x, table, nomatch = 0) == 0
logical; if x
is not present in table
c("A", "B", "3") %out% LETTERS
c("A", "B", "3") %out% LETTERS
Create or use a named vector (c("full" = "abb")
) and pass it to
stringr::str_replace_all()
. The full
argument is surrounded with \\b
to
capture only isolated intended full versions. Note that the built-in
usps_street, usps_city, and usps_state dataframes have the columns
reversed from what this function needs (to work by default with the
counterpart expand_abbrev()
).
abbrev_full(x, full = NULL, rep = NULL, end = FALSE)
abbrev_full(x, full = NULL, rep = NULL, end = FALSE)
x |
A vector containing full words. |
full |
One of three objects: (1) A dataframe with full strings in the
first column and corresponding abbreviations in the second
column; (2) a named vector, with full strings as names for their
respective abbreviations (e.g., |
rep |
If |
end |
logical; if |
The vector x
with full words replaced with their abbreviations.
Other geographic normalization functions:
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
abbrev_full("MOUNT VERNON", full = c("MOUNT" = "MT")) abbrev_full("123 MOUNTAIN ROAD", full = usps_street) abbrev_full("123 MOUNTAIN ROAD", full = usps_street, end = TRUE) abbrev_full("Vermont", full = state.name, rep = state.abb)
abbrev_full("MOUNT VERNON", full = c("MOUNT" = "MT")) abbrev_full("123 MOUNTAIN ROAD", full = usps_street) abbrev_full("123 MOUNTAIN ROAD", full = usps_street, end = TRUE) abbrev_full("Vermont", full = state.name, rep = state.abb)
This function is used to first normalize a full
state name and then call
abbrev_full()
using valid_name and valid_state as the full
and rep
arguments.
abbrev_state(full)
abbrev_state(full)
full |
A full US state name character vector (e.g., "Vermont"). |
The 2-letter USPS abbreviation of for state names (e.g., "VT").
Other geographic normalization functions:
abbrev_full()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
abbrev_state(full = state.name) abbrev_state(full = c("new mexico", "france"))
abbrev_state(full = state.name) abbrev_state(full = c("new mexico", "france"))
Use prop.table()
to add a proportion column to a dplyr::count()
tibble.
add_prop(.data, n, sum = FALSE)
add_prop(.data, n, sum = FALSE)
.data |
A data frame with a count column. |
n |
The column name with a count, usually |
sum |
Should |
mean(x %in% y)
A data frame with the new column p
.
add_prop(dplyr::count(ggplot2::diamonds, cut))
add_prop(dplyr::count(ggplot2::diamonds, cut))
Tests whether all the files in a given directory have a modification date
equal to the system date. Useful when repeatedly running code with a lengthy
download stage. Many state databases are updated daily, so new data can be
helpful but not always necessary. Set this function in an if
statement.
all_files_new(path, glob = NULL, ...)
all_files_new(path, glob = NULL, ...)
path |
The path to a directory to check. |
glob |
A pattern to search for files (e.g., "*.csv"). |
... |
Additional arguments passed to |
logical; Whether all()
files in the directory have a modification
date equal to today.
tmp <- tempdir() file.create(tempfile(pattern = as.character(1:5))) all_files_new(tmp)
tmp <- tempdir() file.create(tempfile(pattern = as.character(1:5))) all_files_new(tmp)
Check whether a place is a valid place or misspelling by matching against the
Google Geocoding search result. Use the httr::GET()
to send a request to
the Google Maps API for geocoding information. The query will concatenate all
the geographical information that is passed in into a long string. Then the
function pulls the formatted_address
endpoint of the API results and then
identifies and extracts the long name field from the API locality result
and compare it against the input to see if the input and output match up.
Note that you will need to pass in your Google Maps Place API key to the
key
argument.
check_city(city = NULL, state = NULL, zip = NULL, key = NULL, guess = FALSE)
check_city(city = NULL, state = NULL, zip = NULL, key = NULL, guess = FALSE)
city |
A string of city name to be submitted to the Geocode API. |
state |
Optional. The state associated with the |
zip |
Optional. Supply a string of ZIP code to increase precision. |
key |
A character string to be passed into |
guess |
logical; Should the function return a single row tibble containing the original data sent and the multiple components returned by the Geocode API. |
A logical value by default. If the city returned by the API
comes back the same as the city input, the function will evaluate to
TRUE
, in all other circumstances (including API errors) FALSE
is returned.
If the the guess
argument is set to TRUE
, a tibble with 1 row and six
columns is returned:
original_city
: The city
value sent to the API.
original_state
: The state
value sent to the API.
original_zip
: The zip
value sent to the API.
check_city_flag
: logical; whether the guessed city matches.
guess_city
: The legal city guessed by the API.
guess_place
: The generic locality guessed by the API.
https://developers.google.com/maps/documentation/geocoding/overview?csw=1
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
Parse dates with format MM/DD/YYYY. This function simply wraps around
readr::col_date()
with the format
argument set to "%m/%d/%Y"
. Many US
campaign finance datasets use this format.
col_date_mdy() col_date_usa()
col_date_mdy() col_date_usa()
A POSIXct
vector.
readr::read_csv(file = "x\n11/09/2016", col_types = readr::cols(x = col_date_mdy()))
readr::read_csv(file = "x\n11/09/2016", col_types = readr::cols(x = col_date_mdy()))
Apply a counting summary function like dplyr::n_distinct()
or count_na()
to every column of a data frame and return the results along with a
percentage of that value.
col_stats(data, fun, print = TRUE) glimpse_fun(data, fun, print = TRUE)
col_stats(data, fun, print = TRUE) glimpse_fun(data, fun, print = TRUE)
data |
A data frame to glimpse. |
fun |
A function to map to each column. |
print |
logical; Should all columns be printed as rows? |
A tibble with a row for every column with the count and proportion.
col_stats(dplyr::storms, dplyr::n_distinct) col_stats(dplyr::storms, campfin::count_na)
col_stats(dplyr::storms, dplyr::n_distinct) col_stats(dplyr::storms, campfin::count_na)
Find the length of the set of difference between x
and y
vectors.
count_diff(x, y, ignore.case = FALSE)
count_diff(x, y, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
ignore.case |
logical; if |
sum(x %out% y)
The number of unique values of x
not in y
.
Other counting wrappers:
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
# only unique values are checked count_diff(c("VT", "NH", "ZZ", "ZZ", "ME"), state.abb)
# only unique values are checked count_diff(c("VT", "NH", "ZZ", "ZZ", "ME"), state.abb)
Count the total values of x
that are %in%
the vector y
.
count_in(x, y, na.rm = TRUE, ignore.case = FALSE)
count_in(x, y, na.rm = TRUE, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
na.rm |
logical; Should |
ignore.case |
logical; if |
sum(x %out% y)
The sum of x
present in y
.
Other counting wrappers:
count_diff()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
count_in(c("VT", "NH", "ZZ", "ME"), state.abb)
count_in(c("VT", "NH", "ZZ", "ME"), state.abb)
Count the total values of x
that are NA
.
count_na(x)
count_na(x)
x |
A vector to check. |
sum(is.na(x))
The sum of x
that are NA
Other counting wrappers:
count_diff()
,
count_in()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
count_na(c("VT", "NH", NA, "ME"))
count_na(c("VT", "NH", NA, "ME"))
Count the total values of x
that are are %out%
of the vector y
.
count_out(x, y, na.rm = TRUE, ignore.case = FALSE)
count_out(x, y, na.rm = TRUE, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
na.rm |
logical; Should |
ignore.case |
logical; if |
sum(x %out% y)
The sum of x
absent in y
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
count_out(c("VT", "NH", "ZZ", "ME"), state.abb)
count_out(c("VT", "NH", "ZZ", "ME"), state.abb)
The Dark2 brewer color palette
dark2
dark2
A named character vector of hex color codes (length 8).
Create or use a named vector (c("abb" = "rep")
) and pass it to
stringr::str_replace_all()
. The abb
argument is surrounded with \\b
to capture only isolated abbreviations. To be used inside
normal_address()
and normal_city()
with usps_street and usps_city,
respectively.
expand_abbrev(x, abb = NULL, rep = NULL)
expand_abbrev(x, abb = NULL, rep = NULL)
x |
A vector containing abbreviations. |
abb |
One of three objects: (1) A dataframe with abbreviations in the
first column and corresponding replacement strings in the second
column; (2) a named vector, with abbreviations as names for their
respective replacements (e.g., |
rep |
If |
The vector x
with abbreviation replaced with their full version.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
expand_abbrev(x = "MT VERNON", abb = c("MT" = "MOUNT")) expand_abbrev(x = "VT", abb = state.abb, rep = state.name) expand_abbrev(x = "Low FE Level", abb = tibble::tibble(x = "FE", y = "Iron"))
expand_abbrev(x = "MT VERNON", abb = c("MT" = "MOUNT")) expand_abbrev(x = "VT", abb = state.abb, rep = state.name) expand_abbrev(x = "Low FE Level", abb = tibble::tibble(x = "FE", y = "Iron"))
This function is used to first normalize an abb
and then call
expand_abbrev()
using valid_state and valid_name as the abb
and rep
arguments.
expand_state(abb)
expand_state(abb)
abb |
A abb US state name character vector (e.g., "Vermont"). |
The 2-letter USPS abbreviation of for state names (e.g., "VT").
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
expand_state(abb = state.abb) expand_state(abb = c("nm", "fr"))
expand_state(abb = state.abb) expand_state(abb = c("nm", "fr"))
This function simply wraps around ggplot2::geom_col()
to take a dataframe
and categorical variable to return a custom barplot ggplot
object. The bars
are arranged in descending order and are limited to the 8 most frequent
values.
explore_plot(data, var, nbar = 8, palette = "Dark2", na.rm = TRUE)
explore_plot(data, var, nbar = 8, palette = "Dark2", na.rm = TRUE)
data |
The data frame to explore. |
var |
A variable to plot. |
nbar |
The number of bars to plot. Always shows most common values. |
palette |
The color palette passed to [ggplot2::scale_fill_brewer(). |
na.rm |
logical: Should |
A ggplot
barplot object. Can then be combined with other ggplot
layers with +
to customize.
explore_plot(iris, Species)
explore_plot(iris, Species)
Cities not contained in valid_city, but are
accepted localities (neighborhoods or census designated
places). This vector consists of normalized self-reported cities in the
public data processed by accountability project that were validated
by Google Maps Geocoding API (whose check_city()
results evaluate to TRUE
).
The most recent updated version of the extra_city can be found in
this Google Sheet
extra_city
extra_city
A sorted vector of unique locality names (length 127).
Use the httr::GET()
to send a request to the Google Maps API for geocoding
information. The query will concatenate all the geographical information that
is passed in into a single string. Then the function pulls the
formatted_address
endpoint of the API results and extracts the the first
field of the result. Note that you will need to pass in your Google Maps
Place API key with the key
argument.
fetch_city(address = NULL, key = NULL)
fetch_city(address = NULL, key = NULL)
address |
A vector of street addresses. Sent to the API as one string. |
key |
A character containing your alphanumeric Google Maps API key. |
A character vector of formatted address endpoints from Google. This
will include all the fields from street address, city, state/province,
zipcode/postal code to country/regions. NA_character_
is returned for
all errored API calls.
https://developers.google.com/maps/documentation/geocoding/overview?csw=1
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
The period of time since a system file was modified.
file_age(...)
file_age(...)
... |
Arguments passed to |
A Period class object.
file_age(system.file("README.md", package = "campfin"))
file_age(system.file("README.md", package = "campfin"))
Call the file
command line tool with option -i
.
file_encoding(path)
file_encoding(path)
path |
A local file path or glob to check. |
A tibble of file encoding.
This function uses dplyr::mutate()
to create a new dupe_flag
logical
variable with TRUE
values for any record duplicated more than once.
flag_dupes(data, ..., .check = TRUE, .both = TRUE)
flag_dupes(data, ..., .check = TRUE, .both = TRUE)
data |
A data frame to flag. |
... |
Arguments passed to |
.check |
Whether the resulting column should be summed and removed if empty. |
.both |
Whether to flag both duplicates or just subsequent. |
A data frame with a new dupe_flag
logical variable.
flag_dupes(iris, dplyr::everything()) flag_dupes(iris, dplyr::everything(), .both = FALSE)
flag_dupes(iris, dplyr::everything()) flag_dupes(iris, dplyr::everything(), .both = FALSE)
This function uses dplyr::mutate()
to create a new na_flag
logical
variable with TRUE
values for any record missing any value in the
selected columns.
flag_na(data, ...)
flag_na(data, ...)
data |
A data frame to flag. |
... |
Arguments passed to |
A data frame with a new na_flag
logical variable.
flag_na(dplyr::starwars, hair_color)
flag_na(dplyr::starwars, hair_color)
Run a full gc()
a number of times.
flush_memory(n = 1)
flush_memory(n = 1)
n |
The number of times to run |
Taken from code used in vroom::vroom() with automatic reading.
guess_delim(file, delims = c(",", "\t", "|", ";"), string = FALSE)
guess_delim(file, delims = c(",", "\t", "|", ";"), string = FALSE)
file |
Either a path to a file or character string (with at least one newline character). |
delims |
The vector of single characters to guess from. Defaults to: comma, tab, pipe, or semicolon. |
string |
Should the file be treated as a string regardless of newline. |
The single character guessed as a delimiter.
https://github.com/tidyverse/vroom/blob/85143f7a417376eaf0e2037ca9575f637e4346c2/R/vroom.R#L288
guess_delim(system.file("extdata", "vt_contribs.csv", package = "campfin")) guess_delim("ID;FirstName;MI;LastName;JobTitle", string = TRUE) guess_delim(" a|b|c 1|2|3 ")
guess_delim(system.file("extdata", "vt_contribs.csv", package = "campfin")) guess_delim("ID;FirstName;MI;LastName;JobTitle", string = TRUE) guess_delim(" a|b|c 1|2|3 ")
A custom vector containing common invalid city names.
invalid_city
invalid_city
A vector of length 54.
Invert the names and elements of a vector, useful when using named vectors as
the abbreviation arguments both of expand_abbrev()
and abbrev_full()
(or
their parent normalization functions like normal_address()
)
invert_named(x)
invert_named(x)
x |
A named vector. |
A named vector with names in place of elements and vice versa.
invert_named(x = c("name" = "element"))
invert_named(x = c("name" = "element"))
To return a value of TRUE
, (1) the first letter of abb
must match the
first letter of full
, (2) all letters of abb
must exist in full
, and
(3) those letters of abb
must be in the same order as they appear in
full
.
is_abbrev(abb, full)
is_abbrev(abb, full)
abb |
A suspected abbreviation |
full |
A long form string to test against |
logical; whether abb
is potential abbreviation of full
is_abbrev(abb = "BRX", full = "BRONX") is_abbrev(abb = state.abb, full = state.name) is_abbrev(abb = "NOLA", full = "New Orleans") is_abbrev(abb = "FE", full = "Iron")
is_abbrev(abb = "BRX", full = "BRONX") is_abbrev(abb = state.abb, full = state.name) is_abbrev(abb = "NOLA", full = "New Orleans") is_abbrev(abb = "FE", full = "Iron")
Uses dplyr::n_distinct()
to check if there are only two unique values.
is_binary(x, na.rm = TRUE)
is_binary(x, na.rm = TRUE)
x |
A vector. |
na.rm |
logical; Should NA be ignored, |
TRUE
if only 2 unique values.
if (is_binary(x <- c("Yes", "No"))) x == "Yes"
if (is_binary(x <- c("Yes", "No"))) x == "Yes"
Check if even
is_even(x)
is_even(x)
x |
A numeric vector. |
logical; Whether the integer is even or odd.
is_even(1:10) is_even(10L)
is_even(1:10) is_even(10L)
This function works best when converting numbers to letters, as each number only has a single possible letter. For each letter, there are 3 or 4 possible letters, resulting in a number of possible conversions. This function was intended to convert phonetic telephone numbers to their valid numeric equivalent; when used in this manner, each letter in a string can be lazily replaced without changing the rest of the string.
keypad_convert(x, ext = FALSE)
keypad_convert(x, ext = FALSE)
x |
A vector of characters or letters. |
ext |
logical; Should extension text be converted to numbers. Defaults to
|
When replacing letters, this function relies on the feature of
stringr::str_replace_all()
to work with named vectors (c("A" = "2")
).
If a character vector is supplied, a vector of each elements numeric counterpart is returned. If a numeric vector (or a completely coercible character vector) is supplied, then a list is returned, each element of which contacts a vector of letters for each number.
keypad_convert("1-800-CASH-NOW ext123") keypad_convert(c("abc", "123")) keypad_convert(letters)
keypad_convert("1-800-CASH-NOW ext123") keypad_convert(c("abc", "123")) keypad_convert(letters)
From a character vector, which values are most common?
most_common(x, n = 6)
most_common(x, n = 6)
x |
A vector. |
n |
Number of values to return. |
Sorted vector of n
most common values.
most_common(iris$Species, n = 1)
most_common(iris$Species, n = 1)
Set NA
for the values of x
that are %in%
the vector y
.
na_in(x, y, ignore.case = FALSE)
na_in(x, y, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
ignore.case |
logical; if |
The vector x
missing any values in y
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
na_in(c("VT", "NH", "ZZ", "ME"), state.abb) na_in(1:10, seq(1, 10, 2))
na_in(c("VT", "NH", "ZZ", "ME"), state.abb) na_in(1:10, seq(1, 10, 2))
Set NA
for the values of x
that are %out%
of the vector y
.
na_out(x, y, ignore.case = FALSE)
na_out(x, y, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
ignore.case |
logical; if |
The vector x
missing any values not in y
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
na_out(c("VT", "NH", "ZZ", "ME"), state.abb) na_out(1:10, seq(1, 10, 2))
na_out(c("VT", "NH", "ZZ", "ME"), state.abb) na_out(1:10, seq(1, 10, 2))
Set NA
for the values of x
that contain a single repeating character and
no other characters.
na_rep(x, n = 0)
na_rep(x, n = 0)
x |
A vector to check. |
n |
The minimum number times a character must repeat. If 0, the default,
then any string of one character will be replaced with |
Uses the regular expression "^(.)\\1+$"
.
The vector x
with NA
replacing repeating character values.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
na_rep(c("VT", "NH", "ZZ", "ME"))
na_rep(c("VT", "NH", "ZZ", "ME"))
Show non-ASCII lines of file
non_ascii(path, highlight = FALSE)
non_ascii(path, highlight = FALSE)
path |
The path to a text file to check. |
highlight |
A function used to add ANSI escapes to highlight bytes. |
Tibble of line locations.
non_ascii(system.file("README.md", package = "campfin"))
non_ascii(system.file("README.md", package = "campfin"))
Return consistent version of a US Street Address using stringr::str_*()
functions. Letters are capitalized, punctuation is removed or replaced, and
excess whitespace is trimmed and squished. Optionally, street suffix
abbreviations ("AVE") can be replaced with their long form ("AVENUE").
Invalid addresses from a vector can be removed (possibly using
invalid_city) as well as single (repeating) character strings ("XXXXXX").
normal_address( address, abbs = NULL, na = c("", "NA"), punct = "", na_rep = FALSE, abb_end = TRUE )
normal_address( address, abbs = NULL, na = c("", "NA"), punct = "", na_rep = FALSE, abb_end = TRUE )
address |
A vector of street addresses (ideally without city, state, or postal code). |
abbs |
A named vector or two-column data frame (like usps_street)
passed to |
na |
A character vector of values to make |
punct |
A character value with which to replace all punctuation. |
na_rep |
logical; If |
abb_end |
logical; Should only the last word the string be abbreviated
with the |
A vector of normalized street addresses.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_city()
,
normal_state()
,
normal_zip()
,
str_normal()
normal_address("P.O. #123, C/O John Smith", abbs = usps_street) normal_address("12east 2nd street, #209", abbs = usps_street, abb_end = FALSE)
normal_address("P.O. #123, C/O John Smith", abbs = usps_street) normal_address("12east 2nd street, #209", abbs = usps_street, abb_end = FALSE)
Return consistent version of a city names using stringr::str_*()
functions.
Letters are capitalized, hyphens and underscores are replaced with
whitespace, other punctuation is removed, numbers are removed, and excess
whitespace is trimmed and squished. Optionally, geographic abbreviations
("MT") can be replaced with their long form ("MOUNT"). Invalid addresses from
a vector can be removed (possibly using invalid_city) as well as single
(repeating) character strings ("XXXXXX").
normal_city(city, abbs = NULL, states = NULL, na = c("", "NA"), na_rep = FALSE)
normal_city(city, abbs = NULL, states = NULL, na = c("", "NA"), na_rep = FALSE)
city |
A vector of city names. |
abbs |
A named vector or data frame of abbreviations passed to
expand_abbrev; see expand_abbrev for format of |
states |
A vector of state abbreviations ("VT") to remove from the end (and only end) of city names ("STOWE VT"). |
na |
A vector of values to make |
na_rep |
logical; If |
A vector of normalized city names.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_state()
,
normal_zip()
,
str_normal()
normal_city( city = c("Stowe, VT", "UNKNOWN CITY", "Burlington", "ST JOHNSBURY", "XXX"), abbs = c("ST" = "SAINT"), states = "VT", na = invalid_city, na_rep = TRUE )
normal_city( city = c("Stowe, VT", "UNKNOWN CITY", "Burlington", "ST JOHNSBURY", "XXX"), abbs = c("ST" = "SAINT"), states = "VT", na = invalid_city, na_rep = TRUE )
Take US phone numbers in any number of formats and try to convert them to a standard format.
normal_phone( number, format = "(%a) %e-%l", na_bad = FALSE, convert = FALSE, rm_ext = FALSE )
normal_phone( number, format = "(%a) %e-%l", na_bad = FALSE, convert = FALSE, rm_ext = FALSE )
number |
A vector of phone number in any format. |
format |
The desired output format, with |
na_bad |
logical; Should invalid numbers be replaced with |
convert |
logical; Should |
rm_ext |
logical; Should extensions be removed from the end of a number. |
A normalized telephone number.
normal_phone(number = c("916-225-5887"))
normal_phone(number = c("916-225-5887"))
Return consistent version of a state abbreviations using stringr::str_*()
functions. Letters are capitalized, all non-letters characters are removed,
and excess whitespace is trimmed and squished, and then abbrev_full()
is
called with usps_state.
normal_state( state, abbreviate = TRUE, na = c("", "NA"), na_rep = FALSE, valid = NULL )
normal_state( state, abbreviate = TRUE, na = c("", "NA"), na_rep = FALSE, valid = NULL )
state |
A vector of US state names or abbreviations. |
abbreviate |
If TRUE (default), replace state names with the 2-digit
abbreviation using the built-in |
na |
A vector of values to make |
na_rep |
logical; If |
valid |
A vector of valid abbreviations to compare to and remove those not shared. |
A vector of normalized 2-digit state abbreviations.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_zip()
,
str_normal()
normal_state( state = c("VT", "N/A", "Vermont", "XX", "ZA"), abbreviate = TRUE, na = c("", "NA"), na_rep = TRUE, valid = NULL )
normal_state( state = c("VT", "N/A", "Vermont", "XX", "ZA"), abbreviate = TRUE, na = c("", "NA"), na_rep = TRUE, valid = NULL )
Return consistent version US ZIP codes using stringr::str_*()
functions.
Non-number characters are removed, strings are padded with zeroes on the
left, and ZIP+4 suffixes are removed. Invalid ZIP codes from a vector can be
removed as well as single (repeating) character strings.
normal_zip(zip, na = c("", "NA"), na_rep = FALSE, pad = FALSE)
normal_zip(zip, na = c("", "NA"), na_rep = FALSE, pad = FALSE)
zip |
A vector of US ZIP codes. |
na |
A vector of values to pass to |
na_rep |
logical; If |
pad |
logical; Should ZIP codes less than five digits be padded with a leading zero? Leading zeros (as are found in New England ZIP codes) are often dropped by programs like Microsoft Excel when parsed as numeric values. |
A character vector of normalized 5-digit ZIP codes.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
str_normal()
normal_zip( zip = c("05672-5563", "N/A", "05401", "5819", "00000"), na = c("", "NA"), na_rep = TRUE, pad = TRUE )
normal_zip( zip = c("05672-5563", "N/A", "05401", "5819", "00000"), na = c("", "NA"), na_rep = TRUE, pad = TRUE )
This is an inverse of path.expand()
, which replaces the home directory or
project directory with a tilde.
path.abbrev(path, dir = fs::path_wd())
path.abbrev(path, dir = fs::path_wd())
path |
Character vector containing one or more full paths. |
dir |
The directory to replace with |
Abbreviated file paths.
print(fs::path_wd("test")) path.abbrev(fs::path_wd("test"))
print(fs::path_wd("test")) path.abbrev(fs::path_wd("test"))
Create a tibble with rows for each stage of normalization and columns for the various statistics most useful in assessing the progress of each stage.
progress_table(..., compare)
progress_table(..., compare)
... |
Any number of vectors to check. |
compare |
A vector to compare each of |
A table with a row for each vector in ...
.
progress_table(state.name, toupper(state.name), compare = valid_name)
progress_table(state.name, toupper(state.name), compare = valid_name)
Find the proportion of values of x
that are distinct.
prop_distinct(x)
prop_distinct(x)
x |
A vector to check. |
length(unique(x))/length(x)
The ratio of distinct values x
to total values of x
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
prop_distinct(c("VT", "VT", NA, "ME"))
prop_distinct(c("VT", "VT", NA, "ME"))
Find the proportion of values of x
that are %in%
the vector y
.
prop_in(x, y, na.rm = TRUE, ignore.case = FALSE)
prop_in(x, y, na.rm = TRUE, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
na.rm |
logical; Should |
ignore.case |
logical; if |
mean(x %in% y)
The proportion of x
present in y
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_na()
,
prop_out()
,
what_in()
,
what_out()
prop_in(c("VT", "NH", "ZZ", "ME"), state.abb)
prop_in(c("VT", "NH", "ZZ", "ME"), state.abb)
Find the proportion of values of x
that are NA
.
prop_na(x)
prop_na(x)
x |
A vector to check. |
mean(is.na(x))
The proportion of values of x
that are NA
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_out()
,
what_in()
,
what_out()
prop_na(c("VT", "NH", NA, "ME"))
prop_na(c("VT", "NH", NA, "ME"))
Find the proportion of values of x
that are %out%
of the vector y
.
prop_out(x, y, na.rm = TRUE, ignore.case = FALSE)
prop_out(x, y, na.rm = TRUE, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
na.rm |
logical; Should |
ignore.case |
logical; if |
mean(x %out% y)
The proportion of x
absent in y
.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
what_in()
,
what_out()
prop_out(c("VT", "NH", "ZZ", "ME"), state.abb)
prop_out(c("VT", "NH", "ZZ", "ME"), state.abb)
Read the first line of a delimited file as vector.
read_names(file, delim = guess_delim(file))
read_names(file, delim = guess_delim(file))
file |
Path to text file. |
delim |
Character separating column names. |
Character vector of column names.
read_names("date,lgl\n11/09/2016,TRUE")
read_names("date,lgl\n11/09/2016,TRUE")
When performing a dplyr::left_join()
, the suffix
argument allows the user
to replace the default .x
and .y
that are appended to column names shared
between the two data frames. This function allows a user to convert those
suffixes to prefixes.
rename_prefix(df, suffix = c(".x", ".y"), punct = TRUE)
rename_prefix(df, suffix = c(".x", ".y"), punct = TRUE)
df |
A joined data frame. |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. Will be converted to prefixes. |
punct |
logical; Should punctuation at the start of the suffix be
detected and placed at the end of the new prefix? |
A data frame with new column names.
a <- data.frame(x = letters[1:3], y = 1:3) b <- data.frame(x = letters[1:3], y = 4:6) df <- dplyr::left_join(a, b, by = "x", suffix = c(".a", ".b")) rename_prefix(df, suffix = c(".a", ".b"), punct = TRUE)
a <- data.frame(x = letters[1:3], y = 1:3) b <- data.frame(x = letters[1:3], y = 4:6) df <- dplyr::left_join(a, b, by = "x", suffix = c(".a", ".b")) rename_prefix(df, suffix = c(".a", ".b"), punct = TRUE)
Wrap a word in word boundary (\\b
) characters. Useful when combined with
stringr::str_which()
and stringr::str_detect()
to match only entire words
and not that word inside another word (e.g., "sting" and "testing").
rx_break(pattern)
rx_break(pattern)
pattern |
A regex pattern (a word) to wrap in |
The a glue vector of pattern
wrapped in \\b
.
rx_break("test") rx_break(state.abb[1:5])
rx_break("test") rx_break(state.abb[1:5])
The regex string to match US phone numbers in a variety of common formats.
rx_phone
rx_phone
A character string (length 1).
The regex string to extract state string preceding ZIP code.
rx_state
rx_state
A character string (length 1).
The regex string to match valid URLs.
rx_url
rx_url
A character string (length 1).
The regex string to extract ZIP code from the end of address.
rx_zip
rx_zip
A character string (length 1).
Truncate the labels of a plot's discrete x-axis labels so that the text does not overflow and collide with other bars.
scale_x_truncate(n = 15, ...) scale_x_wrap(width = 15, ...)
scale_x_truncate(n = 15, ...) scale_x_wrap(width = 15, ...)
n |
The maximum width of string. Passed to |
... |
Additional arguments passed to |
width |
Positive integer giving target line width in characters. A width
less than or equal to 1 will put each word on its own line. Passed to
|
This function wraps around stringdist::stringdist()
.
str_dist(a, b, method = "osa", ...)
str_dist(a, b, method = "osa", ...)
a |
|
b |
|
method |
Method for distance calculation. The default is "osa." |
... |
Other arguments passed to |
The distance between string a
and string b
.
str_dist(a = "BRULINGTN", b = "BURLINGTON")
str_dist(a = "BRULINGTN", b = "BURLINGTON")
The generic normalization that underpins functions like normal_city()
and
normal_address()
. This function simply chains together three
stringr::str_*()
functions:
Convert to uppercase.
Replace punctuation with whitespaces.
Trim and squish excess whitespace.
str_normal(x, case = TRUE, punct = "", quote = TRUE, squish = TRUE)
str_normal(x, case = TRUE, punct = "", quote = TRUE, squish = TRUE)
x |
A character string to normalize. |
case |
logical; whether |
punct |
character; A character string to replace most punctuation with. |
quote |
logical; whether |
squish |
logical; whether |
A normalized vector of the same length.
Other geographic normalization functions:
abbrev_full()
,
abbrev_state()
,
check_city()
,
expand_abbrev()
,
expand_state()
,
fetch_city()
,
normal_address()
,
normal_city()
,
normal_state()
,
normal_zip()
str_normal(" TestING 123 example_test.String ")
str_normal(" TestING 123 example_test.String ")
This function tests whether a single file has a modification date equal to
the system date. Useful when repeatedly running code with a lengthy download
stage. Many state databases are updated daily, so new data can be helpful but
not always necessary. Set this function in an if
statement.
this_file_new(path)
this_file_new(path)
path |
The path to a file to check. |
logical; Whether the file has a modification date equal to today.
tmp <- tempfile() this_file_new(tmp)
tmp <- tempfile() this_file_new(tmp)
Call httr::HEAD()
and return the number of bytes in the file to be
downloaded.
url_file_size(url)
url_file_size(url)
url |
The URL of the file to query. |
The size of a file to be downloaded.
Combine the basename()
of a file URL with a directory path.
url2path(url, dir)
url2path(url, dir)
url |
The URL of a file to download. |
dir |
The directory where the file will be downloaded. |
Useful in the destfile
argument to download.file()
to save a file with
the same name as the URL's file name.
The desired file path to a URL file.
url2path("https://floridalobbyist.gov/reports/llob.txt", tempdir())
url2path("https://floridalobbyist.gov/reports/llob.txt", tempdir())
Take the arguments supplied and put them into the appropriate places in a new template diary. Write the new template diary in the supplied directory.
use_diary( st, type, author, path = "state/{st}/{type}/docs/{st}_{type}_diary.Rmd", auto = FALSE )
use_diary( st, type, author, path = "state/{st}/{type}/docs/{st}_{type}_diary.Rmd", auto = FALSE )
st |
The USPS state abbreviation. State data only, no federal agencies. |
type |
The type of data, one of "contribs", "expends", "lobby", "contracts", "salary", or "voters". |
author |
The author name of the new diary. |
path |
The file path, relative to your working directory, where the
diary file will be created. If you use |
auto |
Must be set to |
The file path of new diary, invisibly.
use_diary("VT", "contribs", "Kiernan Nicholls", NA, auto = FALSE) use_diary("DC", "expends", "Kiernan Nicholls", tempfile(), auto = FALSE)
use_diary("VT", "contribs", "Kiernan Nicholls", NA, auto = FALSE) use_diary("DC", "expends", "Kiernan Nicholls", tempfile(), auto = FALSE)
A curated and edited subset of usps_street containing the
USPS abbreviations found in city names. Useful as the geo_abbs
argument
of normal_city()
.
usps_city
usps_city
A tibble with 154 rows of 2 variables:
Primary Street Suffix
Commonly Used Street Suffix or Abbreviation
...
USPS Appendix C1, Street Abbreviations
A tibble containing the USPS.
usps_state
usps_state
A tibble with 62 rows of 2 variables:
Primary Street Suffix
Commonly Used Street Suffix or Abbreviation
...
USPS Appendix B, Two–Letter State Abbreviations
A tibble containing common street suffixes or suffix
abbreviations and their full equivalent. Useful as the add_abbs
argument
of normal_address()
.
usps_street
usps_street
A tibble with 325 rows of 3 variables:
Primary Street Suffix.
Commonly Used Street Suffix or Abbreviation.
...
USPS Appendix C1 Street Abbreviations.
The abb
column of the usps_state
tibble.
valid_abb
valid_abb
A vector of 2-digit abbreviations (length 62).
The city
column of the zipcodes
tibble.
valid_city
valid_city
A sorted vector of unique city names (length 19,083).
The state
column of the usps_state
tibble.
valid_name
valid_name
A vector of state names (length 62).
Contains 12 more names than datasets::state.name.
The abb
column of the usps_state
tibble.
valid_state
valid_state
A vector of 2-digit abbreviations (length 62).
The zip
column of the geo
tibble.
valid_zip
valid_zip
A sorted vector of 5-digit ZIP codes (length 44334).
Return the values of x
that are %in%
of the vector y
.
what_in(x, y, ignore.case = FALSE)
what_in(x, y, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
ignore.case |
logical; if |
x[which(x %in% y)]
The elements of x
that are %in%
y.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_out()
what_in(c("VT", "DC", NA), state.abb)
what_in(c("VT", "DC", NA), state.abb)
Return the values of x
that are %out%
of the vector y
.
what_out(x, y, na.rm = TRUE, ignore.case = FALSE)
what_out(x, y, na.rm = TRUE, ignore.case = FALSE)
x |
A vector to check. |
y |
A vector to compare against. |
na.rm |
logical; Should |
ignore.case |
logical; if |
x[which(x %out% y)]
The elements of x
that are %out%
y.
Other counting wrappers:
count_diff()
,
count_in()
,
count_na()
,
count_out()
,
na_in()
,
na_out()
,
na_rep()
,
prop_distinct()
,
prop_in()
,
prop_na()
,
prop_out()
,
what_in()
what_out(c("VT", "DC", NA), state.abb)
what_out(c("VT", "DC", NA), state.abb)
This tibble is the third version of a popular zipcodes database.
The original CivicSpace US ZIP Code Database was created by Schuyler Erle
using ZIP code gazetteers from the US Census Bureau from 1999 and 2000,
augmented with additional ZIP code information from the Census Bureau’s
TIGER/Line 2003 data set. The second version was published as the
zipcode::zipcode
dataframe object. This version has dropped the latitude
and longitude, reorganized columns, and normalize the city values with
normal_city()
.
zipcodes
zipcodes
A tibble with 44,336 rows of 3 variables:
Normalized city name.
Two letter state abbreviation.
Five-digit ZIP Code.
...
Daniel Coven's federalgovernmentzipcodes.us web site and the CivicSpace US ZIP Code Database written by Schuyler Erle [email protected], 5 August 2004. Original CSV files available from https://web.archive.org/web/20221005220101/http://federalgovernmentzipcodes.us/free-zipcode-database-Primary.csv