Cleaning

Goals of the notebook:

  • Download the data
  • Import the data
  • Check datatypes
  • Create a total_value variable
  • Create a control_type variable
  • Filter the date range (since Jan. 1 2010)
  • Export the cleaned data

Setup

This code chunk sets up the notebook by adding tidyverse.

library(tidyverse)

Downloading the Data

We are downloading the data in this code chunk and commenting the lines so we don’t redownload the data.

# download.file(
 # "https://github.com/utdata/rwd-r-leso/blob/main/data-processed/leso.csv?raw=true",
 # "data-raw/leso.csv",
 # mode = "wb"
# )

Importing the Data

This code chunk is reading the data into R getting a look at the data.

# assigning the tibble

leso <- read_csv("data-raw/leso.csv")
Rows: 99052 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): state, agency_name, nsn, item_name, ui, demil_code, station_type
dbl  (4): sheet, quantity, acquisition_value, demil_ic
dttm (1): ship_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#printing the tibble

leso |> glimpse()
Rows: 99,052
Columns: 12
$ sheet             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…

Removing Unnecessary Columns

This code chunk is using the select function to remove the sheet column. The ! negates it.

leso_tight <- leso |>
  select(!sheet)

leso_tight |> glimpse()
Rows: 99,052
Columns: 11
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…

Creating a total_value column

In this code chunk we are adding the values in two columns and creating a new column for that sum.

leso_total <- leso_tight |> 

mutate(
  total_value = quantity * acquisition_value
  )

leso_total |> glimpse()
Rows: 99,052
Columns: 12
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…

Checking the result

leso_total |> 
  slice_sample(n = 10) |> 
  select(agency_name, quantity, acquisition_value, total_value)
# A tibble: 10 × 4
   agency_name                       quantity acquisition_value total_value
   <chr>                                <dbl>             <dbl>       <dbl>
 1 KERN COUNTY SHERIFF OFFICE               1             34.7        34.7 
 2 MADISON POLICE DEPT                     14           1572       22008   
 3 CLAYTON COUNTY SHERIFF OFFICE            1            138         138   
 4 BOSCAWEN POLICE DEPT                     1            499         499   
 5 WARNER ROBINS POLICE DEPT                1            941.        941.  
 6 STARK CTY PARK DISTRICT ENFOR DIV        1            749         749   
 7 GREENVILLE POLICE DEPT                   1            138         138   
 8 MORRISTOWN POLICE DEPT                   1              0.27        0.27
 9 SHREVEPORT POLICE DEPARTMENT             1            499         499   
10 ELBERT COUNTY SHERIFF OFFICE             1            138         138   

Marking controlled and non-controlled records

In this code chunk we are marking controlled and non-controlled records in the data

leso_control <- leso_total |> 
  mutate(
    control_type = case_when(
      str_detect(item_name, "AIRPLANE") ~ TRUE,
      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
      TRUE ~ TRUE
    )
  )

leso_control |> glimpse()
Rows: 99,052
Columns: 13
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…
$ control_type      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

Filtering our data

In this code chunk we are filtering the data to only see records since Jan. 1 2010

leso_dated <- leso_control |> 
  filter(
    ship_date >= "2010-01-01"
  )

leso_dated |> summary()
    state           agency_name            nsn             item_name        
 Length:67852       Length:67852       Length:67852       Length:67852      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    quantity             ui            acquisition_value   demil_code       
 Min.   :    0.00   Length:67852       Min.   :       0   Length:67852      
 1st Qu.:    1.00   Class :character   1st Qu.:     200   Class :character  
 Median :    1.00   Mode  :character   Median :     499   Mode  :character  
 Mean   :    5.66                      Mean   :   22002                     
 3rd Qu.:    1.00                      3rd Qu.:    3578                     
 Max.   :44000.00                      Max.   :22000000                     
                                                                            
    demil_ic       ship_date                      station_type      
 Min.   :0.000   Min.   :2010-01-05 00:00:00.00   Length:67852      
 1st Qu.:1.000   1st Qu.:2012-03-07 00:00:00.00   Class :character  
 Median :1.000   Median :2014-08-27 00:00:00.00   Mode  :character  
 Mean   :1.436   Mean   :2015-08-15 23:52:06.43                     
 3rd Qu.:1.000   3rd Qu.:2018-05-04 00:00:00.00                     
 Max.   :7.000   Max.   :2023-09-29 07:00:14.00                     
 NA's   :5880                                                       
  total_value       control_type   
 Min.   :       0   Mode :logical  
 1st Qu.:     371   FALSE:7918     
 Median :     749   TRUE :59934    
 Mean   :   24066                  
 3rd Qu.:    6392                  
 Max.   :22000000                  
                                   

Exporting the cleaned data

In this code chunk we are exporting the data we just cleaned for our new analysis notebook

leso_dated |> write_rds("data-processed/01-leso-all.rds")