library(tidyverse)Cleaning
Goals of the notebook:
- Download the data
- Import the data
- Check datatypes
- Create a total_value variable
- Create a control_type variable
- Filter the date range (since Jan. 1 2010)
- Export the cleaned data
Setup
This code chunk sets up the notebook by adding tidyverse.
Downloading the Data
We are downloading the data in this code chunk and commenting the lines so we don’t redownload the data.
# download.file(
# "https://github.com/utdata/rwd-r-leso/blob/main/data-processed/leso.csv?raw=true",
# "data-raw/leso.csv",
# mode = "wb"
# )Importing the Data
This code chunk is reading the data into R getting a look at the data.
# assigning the tibble
leso <- read_csv("data-raw/leso.csv")Rows: 99052 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): state, agency_name, nsn, item_name, ui, demil_code, station_type
dbl (4): sheet, quantity, acquisition_value, demil_ic
dttm (1): ship_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#printing the tibble
leso |> glimpse()Rows: 99,052
Columns: 12
$ sheet <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
Removing Unnecessary Columns
This code chunk is using the select function to remove the sheet column. The ! negates it.
leso_tight <- leso |>
select(!sheet)
leso_tight |> glimpse()Rows: 99,052
Columns: 11
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
Creating a total_value column
In this code chunk we are adding the values in two columns and creating a new column for that sum.
leso_total <- leso_tight |>
mutate(
total_value = quantity * acquisition_value
)
leso_total |> glimpse()Rows: 99,052
Columns: 12
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…
Checking the result
leso_total |>
slice_sample(n = 10) |>
select(agency_name, quantity, acquisition_value, total_value)# A tibble: 10 × 4
agency_name quantity acquisition_value total_value
<chr> <dbl> <dbl> <dbl>
1 KERN COUNTY SHERIFF OFFICE 1 34.7 34.7
2 MADISON POLICE DEPT 14 1572 22008
3 CLAYTON COUNTY SHERIFF OFFICE 1 138 138
4 BOSCAWEN POLICE DEPT 1 499 499
5 WARNER ROBINS POLICE DEPT 1 941. 941.
6 STARK CTY PARK DISTRICT ENFOR DIV 1 749 749
7 GREENVILLE POLICE DEPT 1 138 138
8 MORRISTOWN POLICE DEPT 1 0.27 0.27
9 SHREVEPORT POLICE DEPARTMENT 1 499 499
10 ELBERT COUNTY SHERIFF OFFICE 1 138 138
Marking controlled and non-controlled records
In this code chunk we are marking controlled and non-controlled records in the data
leso_control <- leso_total |>
mutate(
control_type = case_when(
str_detect(item_name, "AIRPLANE") ~ TRUE,
(demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
TRUE ~ TRUE
)
)
leso_control |> glimpse()Rows: 99,052
Columns: 13
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…
$ control_type <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
Filtering our data
In this code chunk we are filtering the data to only see records since Jan. 1 2010
leso_dated <- leso_control |>
filter(
ship_date >= "2010-01-01"
)
leso_dated |> summary() state agency_name nsn item_name
Length:67852 Length:67852 Length:67852 Length:67852
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
quantity ui acquisition_value demil_code
Min. : 0.00 Length:67852 Min. : 0 Length:67852
1st Qu.: 1.00 Class :character 1st Qu.: 200 Class :character
Median : 1.00 Mode :character Median : 499 Mode :character
Mean : 5.66 Mean : 22002
3rd Qu.: 1.00 3rd Qu.: 3578
Max. :44000.00 Max. :22000000
demil_ic ship_date station_type
Min. :0.000 Min. :2010-01-05 00:00:00.00 Length:67852
1st Qu.:1.000 1st Qu.:2012-03-07 00:00:00.00 Class :character
Median :1.000 Median :2014-08-27 00:00:00.00 Mode :character
Mean :1.436 Mean :2015-08-15 23:52:06.43
3rd Qu.:1.000 3rd Qu.:2018-05-04 00:00:00.00
Max. :7.000 Max. :2023-09-29 07:00:14.00
NA's :5880
total_value control_type
Min. : 0 Mode :logical
1st Qu.: 371 FALSE:7918
Median : 749 TRUE :59934
Mean : 24066
3rd Qu.: 6392
Max. :22000000
Exporting the cleaned data
In this code chunk we are exporting the data we just cleaned for our new analysis notebook
leso_dated |> write_rds("data-processed/01-leso-all.rds")