After Uploading the Dataset in R I Have More Rows With False
Examining the dataset
Posted On 21 Oct 2021
Throughout this form, y'all'll be analyzing a dataset of traffic stops in Rhode Island that was nerveless by the Stanford Open Policing Projection.
Earlier beginning your analysis, it'south important that you lot familiarize yourself with the dataset. In this practice, you'll read the dataset into pandas, examine the kickoff few rows, and so count the number of missing values.
Instructions
- Import
pandasusing the allonympd. - Read the file
law.csvinto a DataFrame namedri. - Examine the first 5 rows of the DataFrame (known every bit the "head").
- Count the number of missing values in each cavalcade: Use
.isnull()to check which DataFrame elements are missing, and so have the.sum()to count the number ofTruthfulvalues in each column.
# Import the pandas library equally pd import pandas as pd # Read 'law.csv' into a DataFrame named ri ri = pd.read_csv('police.csv') # Examine the caput of the DataFrame print(ri.head()) # Count the number of missing values in each column print(ri.isnull().sum()) <script.py> output: state stop_date stop_time county_name driver_gender driver_race \ 0 RI 2005-01-04 12:55 NaN Yard White one RI 2005-01-23 23:15 NaN M White 2 RI 2005-02-17 04:15 NaN M White 3 RI 2005-02-20 17:15 NaN M White iv RI 2005-02-24 01:twenty NaN F White violation_raw violation search_conducted search_type \ 0 Equipment/Inspection Violation Equipment Simulated NaN 1 Speeding Speeding False NaN 2 Speeding Speeding Fake NaN three Phone call for Service Other False NaN 4 Speeding Speeding False NaN stop_outcome is_arrested stop_duration drugs_related_stop commune 0 Citation False 0-15 Min Simulated Zone X4 1 Citation Fake 0-fifteen Min Fake Zone K3 2 Citation False 0-15 Min False Zone X4 three Arrest Driver True 16-30 Min False Zone X1 4 Commendation Fake 0-15 Min Faux Zone X3 state 0 stop_date 0 stop_time 0 county_name 91741 driver_gender 5205 driver_race 5202 violation_raw 5202 violation 5202 search_conducted 0 search_type 88434 stop_outcome 5202 is_arrested 5202 stop_duration 5202 drugs_related_stop 0 district 0 dtype: int64
Information technology looks like most of the columns have at least some missing values. We'll figure out how to handle these values in the side by side exercise!
Dropping columns
Oft, a DataFrame volition incorporate columns that are not useful to your assay. Such columns should be dropped from the DataFrame, to get in easier for y'all to focus on the remaining columns.
In this exercise, you'll drop thecounty_name cavalcade considering it but contains missing values, and you'll drop thecountry column because all of the traffic stops took identify in ane state (Rhode Island). Thus, these columns tin be dropped because they contain no useful information. The number of missing values in each cavalcade has been printed to the console for you lot.
Instructions
- Examine the DataFrame's
.shapeto find out the number of rows and columns. - Drop both the
county_nameandstatecolumns past passing the cavalcade names to the.drop()method as a list of strings. - Examine the
.shapeonce more to verify that there are now two fewer columns.
# Examine the shape of the DataFrame print(ri.shape) # Drop the 'county_name' and 'state' columns ri.drop(['county_name', 'land'], centrality='columns', inplace=True) # Examine the shape of the DataFrame (again) print(ri.shape)
<script.py> output: (91741, 15) (91741, xiii)
Nosotros'll continue to remove unnecessary information from the DataFrame in the adjacent exercise.
Dropping rows
When yous know that a specific column will be disquisitional to your analysis, and only a small-scale fraction of rows are missing a value in that cavalcade, information technology often makes sense to remove those rows from the dataset.
During this grade, thedriver_gender column will be disquisitional to many of your analyses. Considering only a small fraction of rows are missingdriver_gender, we'll drop those rows from the dataset.
Instructions
- Count the number of missing values in each column.
- Drop all rows that are missing
driver_genderby passing the column proper noun to thesubsetparameter of.dropna(). - Count the number of missing values in each column again, to verify that none of the remaining rows are missing
driver_gender. - Examine the DataFrame's
.shapeto come across how many rows and columns remain.
# Count the number of missing values in each cavalcade print(ri.isnull().sum()) # Driblet all rows that are missing 'driver_gender' ri.dropna(subset=['driver_gender'], inplace=True) # Count the number of missing values in each column (again) print(ri.isnull().sum()) # Examine the shape of the DataFrame print(ri.shape)
<script.py> output: stop_date 0 stop_time 0 driver_gender 5205 driver_race 5202 violation_raw 5202 violation 5202 search_conducted 0 search_type 88434 stop_outcome 5202 is_arrested 5202 stop_duration 5202 drugs_related_stop 0 commune 0 dtype: int64 stop_date 0 stop_time 0 driver_gender 0 driver_race 0 violation_raw 0 violation 0 search_conducted 0 search_type 83229 stop_outcome 0 is_arrested 0 stop_duration 0 drugs_related_stop 0 district 0 dtype: int64 (86536, thirteen)
We dropped around five,000 rows, which is a small-scale fraction of the dataset, and at present simply one column remains with any missing values.
Fixing a data type
We saw in the previous exercise that theis_arrested cavalcade currently has theobject data type. In this exercise, we'll change the data type tobool, which is the most suitable type for a column containingTruthful andFalse values.
Fixing the data type will enable the states to use mathematical operations on theis_arrested column that would not be possible otherwise.
Instructions
- Examine the head of the
is_arrestedcavalcade to verify that it containsTrueandFalsevalues and to bank check the column's data type. - Use the
.astype()method to convertis_arrestedto aboolcolumn. - Check the new data type of
is_arrestedto ostend that it is now aboolcolumn.
# Examine the head of the 'is_arrested' cavalcade impress(ri.is_arrested.head()) # Change the data type of 'is_arrested' to 'bool' ri['is_arrested'] = ri.is_arrested.astype('bool') # Check the data type of 'is_arrested' print(ri.is_arrested.dtypes) 0 False ane False 2 Fake 3 True 4 Simulated Name: is_arrested, dtype: bool bool
It's best to ready these data blazon problems early on, before you begin your analysis.
Combining object columns
Currently, the date and time of each traffic stop are stored in separate object columns:stop_date andstop_time.
In this exercise, you'll combine these two columns into a single column, and and so convert information technology todatetime format. This will enable convenient date-based attributes that we'll use later in the form.
Instructions
- Use a string method to concatenate
stop_dateandstop_time(separated by a space), and store the result incombined. - Catechumen
combinedtodatetimeformat, and store the consequence in a new column namedstop_datetime. - Examine the DataFrame
.dtypesto confirm thatstop_datetimeis adatetimecavalcade.
# Concatenate 'stop_date' and 'stop_time' (separated by a space) combined = ri.stop_date.str.cat(ri.stop_time, sep=' ') # Catechumen 'combined' to datetime format ri['stop_datetime'] = pd.to_datetime(combined) # Examine the data types of the DataFrame impress(ri.dtypes)
<script.py> output: stop_date object stop_time object driver_gender object driver_race object violation_raw object violation object search_conducted bool search_type object stop_outcome object is_arrested bool stop_duration object drugs_related_stop bool district object stop_datetime datetime64[ns] dtype: object
At present we're set to gear up thestop_datetimecolumn as the index.
Setting the alphabetize
The last step that you'll accept in this chapter is to ready thestop_datetime column as the DataFrame's index. By replacing the default index with aDatetimeIndex, you'll make information technology easier to clarify the dataset past appointment and time, which volition come in handy later in the class!
Instructions
- Set
stop_datetimeas the DataFrame alphabetize. - Examine the index to verify that it is a
DatetimeIndex. - Examine the DataFrame columns to ostend that
stop_datetimeis no longer one of the columns.
<script.py> output: DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:xv:00', '2005-02-17 04:fifteen:00', '2005-02-20 17:15:00', '2005-02-24 01:20:00', '2005-03-14 10:00:00', '2005-03-29 21:55:00', '2005-04-04 21:25:00', '2005-07-xiv xi:20:00', '2005-07-14 19:55:00', ... '2015-12-31 13:23:00', '2015-12-31 xviii:59:00', '2015-12-31 xix:thirteen:00', '2015-12-31 xx:20:00', '2015-12-31 20:50:00', '2015-12-31 21:21:00', '2015-12-31 21:59:00', '2015-12-31 22:04:00', '2015-12-31 22:09:00', '2015-12-31 22:47:00'], dtype='datetime64[ns]', proper noun='stop_datetime', length=86536, freq=None) Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race', 'violation_raw', 'violation', 'search_conducted', 'search_type', 'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop', 'district'], dtype='object')
Now that you accept cleaned the dataset, you tin begin analyzing it in the side by side chapter.
Examining traffic violations
Before comparing the violations existence committed by each gender, yous should examine the violations committed by all drivers to get a baseline agreement of the data.
In this do, yous'll count the unique values in theviolation column, and then separately limited those counts as proportions.
Instructions
- Count the unique values in the
violationcolumn of theriDataFrame, to run across what violations are being committed by all drivers. - Express the violation counts equally proportions of the total.
# Count the unique values in 'violation' print(ri.violation.value_counts()) # Express the counts equally proportions print(ri.violation.value_counts(normalize=True))
<script.py> output: Speeding 48423 Moving violation 16224 Equipment 10921 Other 4409 Registration/plates 3703 Seat belt 2856 Name: violation, dtype: int64 Speeding 0.559571 Moving violation 0.187483 Equipment 0.126202 Other 0.050950 Registration/plates 0.042791 Seat belt 0.033004 Proper noun: violation, dtype: float64
More half of all violations are for speeding, followed by other moving violations and equipment violations.
Comparing violations past gender
The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.
In this exercise, you lot'll first create a DataFrame for each gender, and then clarify the violations in each DataFrame separately.
Instructions
- Create a DataFrame,
female, that just contains rows in whichdriver_genderis'F'. - Create a DataFrame,
male, that merely contains rows in whichdriver_genderis'M'. - Count the violations committed by female drivers and limited them as proportions.
- Count the violations committed by male drivers and express them as proportions.
# Create a DataFrame of female drivers female = ri[ri['driver_gender']=='F'] # Create a DataFrame of male drivers male = ri[ri['driver_gender']=='M'] # Compute the violations past female drivers (as proportions) print(female.violation.value_counts(normalize=True)) # Compute the violations past male drivers (every bit proportions) print(male.violation.value_counts(normalize=True))
<script.py> output: Speeding 0.658114 Moving violation 0.138218 Equipment 0.105199 Registration/plates 0.044418 Other 0.029738 Seat belt 0.024312 Name: violation, dtype: float64 Speeding 0.522243 Moving violation 0.206144 Equipment 0.134158 Other 0.058985 Registration/plates 0.042175 Seat chugalug 0.036296 Proper name: violation, dtype: float64
About two-thirds of female person traffic stops are for speeding, whereas stops of males are more balanced among the 6 categories. This doesn't mean that females speed more than often than males, nevertheless, since we didn't take into account the number of stops or drivers.
Comparison speeding outcomes by gender
When a commuter is pulled over for speeding, many people believe that gender has an affect on whether the driver will receive a ticket or a warning. Tin can you find evidence of this in the dataset?
Starting time, you'll create 2 DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.
Then, for each gender, you'll use thestop_outcome column to summate what percent of stops resulted in a "Commendation" (pregnant a ticket) versus a "Warning".
Instructions
- Create a DataFrame,
female_and_speeding, that only includes female drivers who were stopped for speeding. - Create a DataFrame,
male_and_speeding, that but includes male drivers who were stopped for speeding. - Count the stop outcomes for the female drivers and limited them as proportions.
- Count the stop outcomes for the male drivers and express them as proportions.
# Create a DataFrame of female drivers stopped for speeding female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')] # Create a DataFrame of male drivers stopped for speeding male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')] # Compute the cease outcomes for female person drivers (as proportions) print(female_and_speeding.stop_outcome.value_counts(normalize = True)) # Compute the stop outcomes for male drivers (as proportions) print(male_and_speeding.stop_outcome.value_counts(normalize = True))
Citation 0.952192 Alert 0.040074 Abort Driver 0.005752 N/D 0.000959 Arrest Passenger 0.000639 No Activeness 0.000383 Name: stop_outcome, dtype: float64 Citation 0.944595 Alert 0.036184 Arrest Driver 0.015895 Abort Passenger 0.001281 No Activity 0.001068 Due north/D 0.000976 Name: stop_outcome, dtype: float64
The numbers are similar for males and females: about 95% of stops for speeding consequence in a ticket. Thus, the data fails to prove that gender has an impact on who gets a ticket for speeding.
Computing the search rate
During a traffic terminate, the constabulary officer sometimes conducts a search of the vehicle. In this exercise, you lot'll calculate the percent of all stops in theri DataFrame that result in a vehicle search, too known as the search charge per unit.
Instructions
- Bank check the data type of
search_conductedto confirm that it's a Boolean Series. - Calculate the search rate by counting the Series values and expressing them every bit proportions.
- Calculate the search rate by taking the mean of the Series. (It should match the proportion of
Truevalues calculated to a higher place.)
# Bank check the data type of 'search_conducted' impress(ri.search_conducted.dtypes) # Calculate the search rate by counting the values print(ri.search_conducted.value_counts(normalize = Truthful)) # Calculate the search rate by taking the mean print(ri.search_conducted.mean())
bool False 0.961785 True 0.038215 Proper noun: search_conducted, dtype: float64 0.0382153092354627
Information technology looks like the search rate is about iii.viii%. Side by side, you'll examine whether the search charge per unit varies past driver gender.
Comparing search rates past gender
In this do, you'll compare the rates at which female person and male person drivers are searched during a traffic finish. Call up that the vehicle search rate beyond all stops is nigh three.viii%.
First, you lot'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a.groupby().
Instructions i/3
- one/iii Filter the DataFrame to only include female person drivers, then calculate the search rate past taking the mean of
search_conducted.
# Calculate the search rate for female drivers print(ri[ri.driver_gender == 'F'].search_conducted.mean())
2/3
- Filter the DataFrame to but include male drivers, then repeat the search charge per unit adding.
# Calculate the search rate for male person drivers print(ri[ri['driver_gender'] == 'M'].search_conducted.mean())
Instructions 3/three
- Grouping by driver gender to calculate the search charge per unit for both groups simultaneously. (It should match the previous results.)
# Calculate the search rate for both groups simultaneously print(ri.groupby('driver_gender').search_conducted.mean()) <script.py> output: driver_gender F 0.019181 M 0.045426 Name: search_conducted, dtype: float64
Male drivers are searched more than than twice as oftentimes every bit female drivers. Why might this be?
Adding a second factor to the analysis
Even though the search rate for males is much higher than for females, it'due south possible that the difference is mostly due to a 2nd gene.
For example, you lot might hypothesize that the search rate varies past violation type, and the difference in search rate between males and females is because they tend to commit unlike violations.
You can exam this hypothesis past examining the search rate for each combination of gender and violation. If the hypothesis was true, you lot would find that males and females are searched at about the same charge per unit for each violation. Find out below if that's the case!
Instructions 1/2
- Utilise a
.groupby()to calculate the search charge per unit for each combination of gender and violation. Are males and females searched at nigh the same rate for each violation?
# Calculate the search rate for female drivers print(ri[ri.driver_gender == 'F'].search_conducted.mean())
2/3 Filter the DataFrame to but include male drivers, and and then repeat the search rate calculation.
# Calculate the search rate for male drivers impress(ri[ri['driver_gender'] == 'M'].search_conducted.hateful())
iii/3 Group by driver gender to calculate the search rate for both groups simultaneously. (It should friction match the previous results.)
# Calculate the search rate for both groups simultaneously print(ri.groupby('driver_gender').search_conducted.mean()) driver_gender F 0.019181 M 0.045426 Name: search_conducted, dtype: float64
Male person drivers are searched more than twice equally frequently every bit female drivers. Why might this be?
Calculation a second gene to the analysis
Even though the search rate for males is much higher than for females, it'southward possible that the difference is more often than not due to a second factor.
For example, y'all might hypothesize that the search rate varies past violation type, and the difference in search charge per unit between males and females is considering they tend to commit different violations.
You lot can test this hypothesis past examining the search charge per unit for each combination of gender and violation. If the hypothesis was truthful, y'all would find that males and females are searched at about the aforementioned charge per unit for each violation. Detect out below if that'south the case!
Instructions two/2
- Reverse the ordering to group by violation earlier gender. The results may be easier to compare when presented this way.
# Reverse the ordering to group by violation before gender impress(ri.groupby(['violation', 'driver_gender']).search_conducted.hateful())
violation driver_gender Equipment F 0.039984 1000 0.071496 Moving violation F 0.039257 M 0.061524 Other F 0.041018 M 0.046191 Registration/plates F 0.054924 M 0.108802 Seat chugalug F 0.017301 M 0.035119 Speeding F 0.008309 M 0.027885 Name: search_conducted, dtype: float64
For all types of violations, the search charge per unit is higher for males than for females, disproving our hypothesis.
Counting protective frisks
During a vehicle search, the law officer may pat downwards the commuter to check if they take a weapon. This is known as a "protective frisk."
In this exercise, you'll outset check to come across how many times "Protective Frisk" was the only search type. So, you'll utilise a string method to locate all instances in which the driver was frisked.
Instructions
- Count the
search_typevalues in theriDataFrame to see how many times "Protective Frisk" was the only search type. - Create a new column,
frisk, that isTruthfulifsearch_typecontains the cord "Protective Frisk" andFalseotherwise. - Bank check the data type of
friskto confirm that it's a Boolean Serial. - Take the sum of
friskto count the total number of frisks.
# Count the 'search_type' values print(ri.search_type.value_counts()) # Check if 'search_type' contains the string 'Protective Frisk' ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False) # Bank check the information type of 'frisk' print(ri['frisk'].dtypes) # Take the sum of 'frisk' print(ri['frisk'].sum()) Incident to Arrest 1290 Probable Cause 924 Inventory 219 Reasonable Suspicion 214 Protective Frisk 164 Incident to Arrest,Inventory 123 Incident to Arrest,Likely Crusade 100 Probable Cause,Reasonable Suspicion 54 Probable Cause,Protective Frisk 35 Incident to Arrest,Inventory,Probable Cause 35 Incident to Arrest,Protective Frisk 33 Inventory,Probable Cause 25 Protective Frisk,Reasonable Suspicion xix Incident to Arrest,Inventory,Protective Frisk 18 Incident to Arrest,Likely Cause,Protective Frisk 13 Inventory,Protective Frisk 12 Incident to Arrest,Reasonable Suspicion eight Incident to Arrest,Probable Cause,Reasonable Suspicion five Probable Cause,Protective Frisk,Reasonable Suspicion v Incident to Abort,Inventory,Reasonable Suspicion 4 Inventory,Reasonable Suspicion ii Incident to Arrest,Protective Frisk,Reasonable Suspicion 2 Inventory,Likely Crusade,Reasonable Suspicion one Inventory,Protective Frisk,Reasonable Suspicion one Inventory,Likely Cause,Protective Frisk 1 Name: search_type, dtype: int64 bool 303
It looks similar there were 303 drivers who were frisked. Next, you'll examine whether gender affects who is frisked.
Comparing frisk rates past gender
In this exercise, you'll compare the rates at which female and male person drivers are frisked during a search. Are males frisked more often than females, peradventure because police officers consider them to exist college gamble?
Before doing any calculations, information technology'south of import to filter the DataFrame to only include the relevant subset of information, namely stops in which a search was conducted.
Instructions
- Create a DataFrame,
searched, that only contains rows in whichsearch_conductedisTrue. - Take the hateful of the
friskcolumn to observe out what percent of searches included a frisk. - Calculate the frisk rate for each gender using a
.groupby().
# Create a DataFrame of stops in which a search was conducted searched = ri[ri.search_conducted == True] # Calculate the overall frisk rate by taking the mean of 'frisk' print(searched.frisk.mean()) # Calculate the frisk rate for each gender print(searched.groupby('driver_gender').frisk.hateful()) 0.09162382824312065 driver_gender F 0.074561 M 0.094353 Proper name: frisk, dtype: float64
The frisk rate is higher for males than for females, though we can't conclude that this deviation is acquired by the driver's gender.
Calculating the hourly arrest rate
When a police officeholder stops a driver, a modest pct of those stops ends in an arrest. This is known equally the abort rate. In this exercise, you'll find out whether the arrest rate varies past time of day.
First, y'all'll calculate the arrest rate across all stops in theri DataFrame. And so, yous'll calculate the hourly abort rate by using thehr aspect of the index. Thehr ranges from 0 to 23, in which:
- 0 = midnight
- 12 = noon
- 23 = xi PM
Instructions
- Take the mean of the
is_arrestedcolumn to summate the overall arrest rate. - Group by the
hrattribute of the DataFrame index to calculate the hourly arrest rate. - Salve the hourly arrest rate Serial as a new object,
hourly_arrest_rate.
# Calculate the overall arrest rate print(ri.is_arrested.mean()) # Summate the hourly arrest rate print(ri.groupby(ri.index.hour).is_arrested.mean()) # Save the hourly arrest rate hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.hateful()
<script.py> output: 0.0355690117407784 stop_datetime 0 0.051431 1 0.064932 2 0.060798 3 0.060549 iv 0.048000 5 0.042781 6 0.013813 vii 0.013032 eight 0.021854 9 0.025206 ten 0.028213 11 0.028897 12 0.037399 13 0.030776 14 0.030605 xv 0.030679 16 0.035281 17 0.040619 eighteen 0.038204 19 0.032245 20 0.038107 21 0.064541 22 0.048666 23 0.047592 Proper noun: is_arrested, dtype: float64
Next you'll plot the data so that you can visually examine the arrest rate trends.
Plotting the hourly abort rate
In this exercise, you'll create a line plot from thehourly_arrest_rate object. A line plot is appropriate in this case because you're showing how a quantity changes over fourth dimension.
This plot should help you to spot some trends that may not accept been obvious when examining the raw numbers!
Instructions
- Import
matplotlib.pyplotusing the aliasplt. - Create a line plot of
hourly_arrest_rateusing the.plot()method. - Label the 10-centrality every bit
'Hour', label the y-centrality as'Arrest Rate', and championship the plot'Arrest Charge per unit by Time of Twenty-four hours'. - Display the plot using the
.show()office.
# Import matplotlib.pyplot as plt import matplotlib.pyplot as plt # Create a line plot of 'hourly_arrest_rate' plt.plot(hourly_arrest_rate) # Add the xlabel, ylabel, and title plt.xlabel('Hour') plt.ylabel('Abort Rate') plt.title('Arrest Rate by Time of 24-hour interval') # Display the plot plt.show()
The abort rate has a meaning spike overnight, then dips in the early morn hours.
In a small portion of traffic stops, drugs are found in the vehicle during a search. In this practise, yous'll assess whether these drug-related stops are condign more than mutual over fourth dimension.
The Boolean cavalcadedrugs_related_stop indicates whether drugs were institute during a given stop. You'll calculate the almanac drug rate by resampling this column, and so you'll use a line plot to visualize how the charge per unit has changed over time.
Instructions
- Calculate the annual rate of drug-related stops past resampling the
drugs_related_stopcolumn (on the'A'frequency) and taking the mean. - Save the annual drug rate Series as a new object,
annual_drug_rate. - Create a line plot of
annual_drug_rateusing the.plot()method. - Display the plot using the
.show()function.
# Calculate the annual rate of drug-related stops print(ri.drugs_related_stop.resample('A').mean()) # Save the annual rate of drug-related stops annual_drug_rate = ri.drugs_related_stop.resample('A').mean() # Create a line plot of 'annual_drug_rate' plt.plot(annual_drug_rate) # Brandish the plot plt.evidence() <script.py> output: stop_datetime 2005-12-31 0.006501 2006-12-31 0.007258 2007-12-31 0.007970 2008-12-31 0.007505 2009-12-31 0.009889 2010-12-31 0.010081 2011-12-31 0.009731 2012-12-31 0.009921 2013-12-31 0.013094 2014-12-31 0.013826 2015-12-31 0.012266 Freq: A-Dec, Name: drugs_related_stop, dtype: float64
The rate of drug-related stops virtually doubled over the course of x years. Why might that be the case?
Comparison drug and search rates
Every bit you saw in the final do, the rate of drug-related stops increased significantly between 2005 and 2015. You might hypothesize that the charge per unit of vehicle searches was also increasing, which would take led to an increase in drug-related stops even if more drivers were not carrying drugs.
You lot can test this hypothesis past calculating the annual search charge per unit, and then plotting it against the almanac drug rate. If the hypothesis is truthful, then y'all'll see both rates increasing over time.
Instructions
- Calculate the annual search rate by resampling the
search_conductedcavalcade, and save the result every bitannual_search_rate. - Concatenate
annual_drug_rateandannual_search_ratealong the columns axis, and relieve the consequence asannual. - Create subplots of the drug and search rates from the
annualDataFrame. - Display the subplots.
# Calculate and save the almanac search charge per unit annual_search_rate = ri.search_conducted.resample('A').hateful() # Concatenate 'annual_drug_rate' and 'annual_search_rate' almanac = pd.concat([annual_drug_rate, annual_search_rate], axis='columns') # Create subplots from 'annual' annual.plot(subplots = True) # Brandish the subplots plt.show()
The charge per unit of drug-related stops increased even though the search rate decreased, disproving our hypothesis.
Crosstab : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
Tallying violations by district
The state of Rhode Island is broken into half dozen police districts, also known as zones. How do the zones compare in terms of what violations are caught by constabulary?
In this practice, you lot'll create a frequency table to determine how many violations of each type took place in each of the six zones. Then, you'll filter the table to focus on the "K" zones, which you'll examine further in the next exercise.
Instructions
- Create a frequency table from the
riDataFrame'sdistrictandviolationcolumns using thepd.crosstab()function. - Salvage the frequency table as a new object,
all_zones. - Select rows
'Zone K1'through'Zone K3'fromall_zonesusing the.loc[]accessor. - Save the smaller tabular array as a new object,
k_zones.
# Create a frequency table of districts and violations print(pd.crosstab(ri.commune, ri.violation)) # Relieve the frequency table equally 'all_zones' all_zones = pd.crosstab(ri.district, ri.violation) # Select rows 'Zone K1' through 'Zone K3' impress(all_zones.loc['Zone K1':'Zone K3']) # Save the smaller table equally 'k_zones' k_zones = all_zones.loc['Zone K1':'Zone K3']
violation Equipment Moving violation Other Registration/plates Seat belt \ district Zone K1 672 1254 290 120 0 Zone K2 2061 2962 942 768 481 Zone K3 2302 2898 705 695 638 Zone X1 296 671 143 38 74 Zone X3 2049 3086 769 671 820 Zone X4 3541 5353 1560 1411 843 violation Speeding district Zone K1 5960 Zone K2 10448 Zone K3 12322 Zone X1 1119 Zone X3 8779 Zone X4 9795 violation Equipment Moving violation Other Registration/plates Seat belt \ district Zone K1 672 1254 290 120 0 Zone K2 2061 2962 942 768 481 Zone K3 2302 2898 705 695 638 violation Speeding district Zone K1 5960 Zone K2 10448 Zone K3 12322
Side by side you'll plot the violations so that you can compare these districts.
Plotting violations by commune
Now that you lot've created a frequency table focused on the "One thousand" zones, you'll visualize the information to assistance you compare what violations are beingness caught in each zone.
Get-go you'll create a bar plot, which is an appropriate plot blazon since you're comparison categorical data. So you lot'll create a stacked bar plot in gild to get a slightly different look at the information. Which plot do you observe to be more than insightful?
Instructions 1/2
- Create a bar plot of
k_zones. - Display the plot and examine it. What do you notice about each of the zones?
# Create a bar plot of 'k_zones' k_zones.plot(kind= 'bar') # Display the plot plt.evidence()
- Create a stacked bar plot of
k_zones. - Display the plot and examine it. Practise y'all notice anything different nearly the data than you lot did previously?
# Create a stacked bar plot of 'k_zones' k_zones.plot(kind='bar', stacked = True) # Display the plot plt.show()
Interesting! The vast majority of traffic stops in Zone K1 are for speeding, and Zones K2 and K3 are remarkably similar to one some other in terms of violations.
Converting cease durations to numbers
In the traffic stops dataset, thestop_duration column tells you approximately how long the driver was detained by the officer. Unfortunately, the durations are stored as strings, such equally'0-15 Min'. How tin can you make this data easier to analyze?
In this exercise, you lot'll convert the end durations to integers. Because the precise durations are not available, you'll have to estimate the numbers using reasonable values:
- Convert
'0-15 Min'to8 - Convert
'xvi-30 Min'to23 - Convert
'30+ Min'to45
Instructions
- Print the unique values in the
stop_durationcavalcade. (This has been done for y'all.) - Create a dictionary called
mappingthat maps thestop_durationstrings to the integers specified in a higher place. - Catechumen the
stop_durationstrings to integers using themapping, and store the results in a new column calledstop_minutes. - Print the unique values in the
stop_minutescolumn, to verify that the durations were properly converted to integers.
# Impress the unique values in 'stop_duration' impress(ri.stop_duration.unique()) # Create a dictionary that maps strings to integers mapping = {'0-15 Min':8, 'xvi-30 Min':23, '30+ Min':45} # Convert the 'stop_duration' strings to integers using the 'mapping' ri['stop_minutes'] = ri.stop_duration.map(mapping) # Print the unique values in 'stop_minutes' print(ri.stop_minutes.unique()) <script.py> output: ['0-fifteen Min' 'sixteen-xxx Min' '30+ Min'] [ 8 23 45]
Adjacent y'all'll analyze the end length for each type of violation.
Plotting stop length
If y'all were stopped for a item violation, how long might you look to be detained?
In this exercise, you'll visualize the average length of fourth dimension drivers are stopped for each type of violation. Rather than using theviolation cavalcade in this exercise, yous'll utilizeviolation_raw since it contains more detailed descriptions of the violations.
Instructions
- For each value in the
riDataFrame'southwardviolation_rawcavalcade, calculate the hateful number ofstop_minutesthat a commuter is detained. - Save the resulting Series as a new object,
stop_length. - Sort
stop_lengthby its values, and then visualize it using a horizontal bar plot. - Display the plot.
# Summate the hateful 'stop_minutes' for each value in 'violation_raw' print(ri.groupby('violation_raw').stop_minutes.mean()) # Save the resulting Series equally 'stop_length' stop_length = ri.groupby('violation_raw').stop_minutes.hateful() # Sort 'stop_length' by its values and create a horizontal bar plot stop_length.sort_values().plot(kind='barh') # Brandish the plot plt.prove() <script.py> output: violation_raw APB 17.967033 Call for Service 22.124371 Equipment/Inspection Violation 11.445655 Motorist Assistance/Courtesy 17.741463 Other Traffic Violation 13.844490 Registration Violation 13.736970 Seatbelt Violation 9.662815 Special Detail/Directed Patrol 15.123632 Speeding ten.581562 Suspicious Person fourteen.910714 Violation of City/Town Ordinance 13.254144 Warrant 24.055556 Name: stop_minutes, dtype: float64
You've completed the chapter on visual exploratory information assay!
Plotting the temperature
In this exercise, y'all'll examine the temperature columns from the atmospheric condition dataset to assess whether the data seems trustworthy. First you'll print the summary statistics, and and then yous'll visualize the data using a box plot.
When deciding whether the values seem reasonable, go along in heed that the temperature is measured in degrees Fahrenheit, non Celsius!
Instructions
- Read
conditions.csvinto a DataFrame namedconditions. - Select the temperature columns (
TMIN,TAVG,TMAX) and print their summary statistics using the.describe()method. - Create a box plot to visualize the temperature columns.
- Display the plot.
# Read 'atmospheric condition.csv' into a DataFrame named 'weather' weather = pd.read_csv('conditions.csv') # Describe the temperature columns print(atmospheric condition[['TMIN', 'TAVG', 'TMAX']].describe()) # Create a box plot of the temperature columns weather.plot(kind='box') # Brandish the plot plt.show() <script.py> output: TMIN TAVG TMAX count 4017.000000 1217.000000 4017.000000 mean 43.484441 52.493016 61.268608 std 17.020298 17.830714 eighteen.199517 min -5.000000 vi.000000 15.000000 25% 30.000000 39.000000 47.000000 fifty% 44.000000 54.000000 62.000000 75% 58.000000 68.000000 77.000000 max 77.000000 86.000000 102.000000
The temperature data looks good then far: theTAVG values are in betwixtTMIN andTMAX, and the measurements and ranges seem reasonable.
Plotting the temperature divergence
In this exercise, you'll continue to appraise whether the dataset seems trustworthy past plotting the departure between the maximum and minimum temperatures.
What do you lot notice about the resulting histogram? Does it match your expectations, or do you lot meet anything unusual?
Instructions
- Create a new column in the
weatherDataFrame namedTDIFFthat represents the difference betwixt the maximum and minimum temperatures. - Print the summary statistics for
TDIFFusing the.describe()method. - Create a histogram with xx bins to visualize
TDIFF. - Display the plot.
# Create a 'TDIFF' column that represents temperature difference weather['TDIFF'] = weather.TMAX - weather.TMIN # Describe the 'TDIFF' column print(weather.TDIFF.depict()) # Create a histogram with 20 bins to visualize 'TDIFF' weather.TDIFF.plot(kind = 'hist', bins = 20) # Brandish the plot plt.show()
TheTDIFF column has no negative values and its distribution is approximately normal, both of which are signs that the data is trustworthy.
<script.py> output: count 4017.000000 hateful 17.784167 std 6.350720 min ii.000000 25% xiv.000000 l% eighteen.000000 75% 22.000000 max 43.000000 Name: TDIFF, dtype: float64
Counting bad weather conditions
Theconditions DataFrame contains xx columns that showtime with'WT', each of which represents a bad conditions condition. For case:
-
WT05indicates "Hail" -
WT11indicates "Loftier or damaging winds" -
WT17indicates "Freezing rain"
For every row in the dataset, eachWT column contains either aone (meaning the condition was present that day) orNaN (pregnant the condition was not present).
In this exercise, you'll quantify "how bad" the weather was each mean solar day by counting the number ofone values in each row.
Instructions
- Copy the columns
WT01throughWT22fromweatherto a new DataFrame namedWT. - Calculate the sum of each row in
WT, and store the results in a newweathercolumn namedbad_conditions. - Supersede any missing values in
bad_conditionswith a0. (This has been done for you.) - Create a histogram to visualize
bad_conditions, and so display the plot.
# Copy 'WT01' through 'WT22' to a new DataFrame WT = weather.loc[:, 'WT01':'WT22'] # Calculate the sum of each row in 'WT' atmospheric condition['bad_conditions'] = WT.sum(axis='columns') # Replace missing values in 'bad_conditions' with '0' weather['bad_conditions'] = conditions.bad_conditions.fillna(0).astype('int') # Create a histogram to visualize 'bad_conditions' atmospheric condition.bad_conditions.plot(kind='hist') # Brandish the plot plt.show()
Information technology looks like many days didn't accept whatever bad weather conditions, and merely a small portion of dayaps had more than 4 bad weather conditions.
Rating the weather condition conditions
In the previous exercise, you counted the number of bad weather weather each twenty-four hour period. In this do, you lot'll use the counts to create a rating organization for the weather.
The counts range from 0 to nine, and should be converted to ratings as follows:
- Catechumen
0to'practiced' - Convert
ithrough4to'bad' - Convert
5through9to'worse'
Instructions
- Count the unique values in the
bad_conditionscolumn and sort the alphabetize. (This has been done for you lot.) - Create a dictionary called
mappingthat maps thebad_conditionsintegers to strings as specified in a higher place. - Convert the
bad_conditionsintegers to strings using themappingand shop the results in a new column calledrating. - Count the unique values in
ratingto verify that the integers were properly converted to strings.
# Count the unique values in 'bad_conditions' and sort the index print(weather condition.bad_conditions.value_counts().sort_index()) # Create a dictionary that maps integers to strings mapping = {0:'practiced', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', vii:'worse', viii:'worse', 9:'worse'} # Convert the 'bad_conditions' integers to strings using the 'mapping' weather['rating'] = weather.bad_conditions.map(mapping) # Count the unique values in 'rating' print(atmospheric condition.rating.value_counts()) <script.py> output: 0 1749 i 613 two 367 3 380 4 476 5 282 6 101 7 41 8 4 nine 4 Name: bad_conditions, dtype: int64 bad 1836 good 1749 worse 432 Proper noun: rating, dtype: int64
This rating system should make the weather condition data easier to understand.
Changing the data type to category
Since therating column merely has a few possible values, y'all'll change its information type to category in order to shop the data more efficiently. You'll as well specify a logical order for the categories, which will be useful for future exercises.
Instructions
- Create a list object called
catsthat lists the weather ratings in a logical order:'skilful','bad','worse'. - Change the information type of the
ratingcolumn from object to category. Make certain to use thecatslisting to define the category ordering. - Examine the head of the
ratingcolumn to confirm that the categories are logically ordered.
# Create a listing of weather ratings in logical order cats =['skillful', 'bad', 'worse'] # Alter the data blazon of 'rating' to category atmospheric condition['rating'] = weather.rating.astype('category', ordered=Truthful, categories=cats) # Examine the head of 'rating' impress(weather.rating.caput()) <script.py> output: 0 bad 1 bad 2 bad 3 bad 4 bad Name: rating, dtype: category Categories (3, object): [skillful < bad < worse]
You'll apply therating column in futurity exercises to analyze the effects of weather condition on police behavior.
Preparing the DataFrames
In this exercise, you'll prepare the traffic stop and weather condition rating DataFrames and then that they're ready to be merged:
- With the
riDataFrame, you'll move thestop_datetimeindex to a cavalcade since the alphabetize will be lost during the merge. - With the
atmospheric conditionDataFrame, you lot'll select theDATEandratingcolumns and put them in a new DataFrame.
Instructions
- Reset the index of the
riDataFrame. - Examine the head of
rito verify thatstop_datetimeis now a DataFrame column, and the alphabetize is now the default integer index. - Create a new DataFrame named
weather_ratingthat contains only theDATEandratingcolumns from theconditionsDataFrame. - Examine the caput of
weather_ratingto verify that information technology contains the proper columns.
# Reset the alphabetize of 'ri' ri.reset_index(inplace=Truthful) # Examine the head of 'ri' print(ri.head()) # Create a DataFrame from the 'Date' and 'rating' columns weather_rating = conditions[['DATE', 'rating']] # Examine the head of 'weather_rating' print(weather_rating.head())
<script.py> output: stop_datetime stop_date stop_time driver_gender driver_race \ 0 2005-01-04 12:55:00 2005-01-04 12:55 M White one 2005-01-23 23:15:00 2005-01-23 23:xv Chiliad White two 2005-02-17 04:fifteen:00 2005-02-17 04:15 G White 3 2005-02-20 17:15:00 2005-02-20 17:15 M White 4 2005-02-24 01:20:00 2005-02-24 01:xx F White violation_raw violation search_conducted search_type \ 0 Equipment/Inspection Violation Equipment False NaN i Speeding Speeding Faux NaN 2 Speeding Speeding Faux NaN 3 Telephone call for Service Other Fake NaN iv Speeding Speeding Simulated NaN stop_outcome is_arrested stop_duration drugs_related_stop district \ 0 Citation Fake 0-xv Min Fake Zone X4 ane Commendation Simulated 0-fifteen Min Faux Zone K3 two Commendation Faux 0-15 Min Simulated Zone X4 three Arrest Commuter True 16-30 Min Simulated Zone X1 4 Commendation False 0-15 Min Simulated Zone X3 frisk stop_minutes 0 Imitation 8 1 Simulated 8 2 False 8 3 Simulated 23 4 Simulated viii Appointment rating 0 2005-01-01 bad 1 2005-01-02 bad 2 2005-01-03 bad 3 2005-01-04 bad 4 2005-01-05 bad
Theri andweather_rating DataFrames are now ready to be merged.
Merging the DataFrames
In this exercise, you'll merge theri andweather_rating DataFrames into a new DataFrame,ri_weather.
The DataFrames will exist joined using thestop_date column fromri and theDATE cavalcade fromweather_rating. Thankfully the date formatting matches exactly, which is not always the case!
Once the merge is complete, you'll setstop_datetime as the index, which is the column you saved in the previous do.
Instructions
- Examine the shape of the
riDataFrame. - Merge the
riandweather_ratingDataFrames using a left join. - Examine the shape of
ri_weatherto confirm that it has 2 more columns but the aforementioned number of rows asri. - Supervene upon the index of
ri_weatherwith thestop_datetimecolumn.
# Examine the shape of 'ri' print(ri.shape) # Merge 'ri' and 'weather_rating' using a left join ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='Engagement', how='left') # Examine the shape of 'ri_weather' print(ri_weather.shape) # Set 'stop_datetime' every bit the index of 'ri_weather' ri_weather.set_index('stop_datetime', inplace=True) <script.py> output: (86536, 16) (86536, 18)
In the side by side department, you'll utilizeri_weather to analyze the relationship between conditions conditions and police behavior.
Comparing abort rates by weather rating
Do police officers arrest drivers more ofttimes when the weather condition is bad? Observe out below!
- First, y'all'll calculate the overall arrest rate.
- Then, you'll calculate the arrest charge per unit for each of the weather ratings you previously assigned.
- Finally, you'll add violation type as a 2d factor in the analysis, to meet if that accounts for any differences in the arrest charge per unit.
Since y'all previously defined a logical guild for the weather categories,adept < bad < worse, they will exist sorted that way in the results.
Instructions 1/3
- 1Calculate the overall arrest rate by taking the mean of the
is_arrestedSeries.
# Calculate the overall arrest charge per unit print(ri_weather.is_arrested.mean())
<script.py> output: 0.0355690117407784
two/3 Calculate the arrest rate for each weatherrating using a.groupby().
# Summate the abort rate for each 'rating' print(ri_weather.groupby('rating').is_arrested.mean()) <script.py> output: rating good 0.033715 bad 0.036261 worse 0.041667 Proper noun: is_arrested, dtype: float64
iii/3 Calculate the arrest rate for each combination ofviolation andrating. How do the arrest rates differ by grouping?
# Calculate the arrest rate for each 'violation' and 'rating' print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
<script.py> output: violation rating Equipment expert 0.059007 bad 0.066311 worse 0.097357 Moving violation skillful 0.056227 bad 0.058050 worse 0.065860 Other good 0.076966 bad 0.087443 worse 0.062893 Registration/plates proficient 0.081574 bad 0.098160 worse 0.115625 Seat belt good 0.028587 bad 0.022493 worse 0.000000 Speeding practiced 0.013405 bad 0.013314 worse 0.016886 Proper noun: is_arrested, dtype: float64
The abort charge per unit increases as the weather gets worse, and that trend persists across many of the violation types. This doesn't prove a causal link, but information technology's quite an interesting outcome!
Selecting from a multi-indexed Series
The output of a single.groupby() performance on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:
- The outer index level is like the DataFrame rows.
- The inner alphabetize level is like the DataFrame columns.
In this exercise, you'll practice accessing data from a multi-indexed Series using the.loc[] accessor.
Instructions
- Save the output of the
.groupby()performance from the last exercise as a new object,arrest_rate. (This has been done for you.) - Impress the
arrest_rateSeries and examine information technology. - Impress the abort charge per unit for moving violations in bad weather.
- Print the abort rates for speeding violations in all three weather weather condition.
# Save the output of the groupby performance from the last exercise arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.hateful() # Print the 'arrest_rate' Series print(arrest_rate) # Print the arrest rate for moving violations in bad weather condition impress(arrest_rate.loc['Moving violation', 'bad']) # Impress the abort rates for speeding violations in all iii weather conditions impress(arrest_rate.loc['Speeding'])
<script.py> output: violation rating Equipment good 0.059007 bad 0.066311 worse 0.097357 Moving violation good 0.056227 bad 0.058050 worse 0.065860 Other good 0.076966 bad 0.087443 worse 0.062893 Registration/plates good 0.081574 bad 0.098160 worse 0.115625 Seat chugalug good 0.028587 bad 0.022493 worse 0.000000 Speeding expert 0.013405 bad 0.013314 worse 0.016886 Proper noun: is_arrested, dtype: float64 0.05804964058049641 rating good 0.013405 bad 0.013314 worse 0.016886 Name: is_arrested, dtype: float64
The.loc[] accessor is a powerful and flexible tool for data pick.
Reshaping the arrest rate data
In this practice, you'll get-go by reshaping thearrest_rate Serial into a DataFrame. This is a useful step when working with whatsoever multi-indexed Series, since it enables yous to access the full range of DataFrame methods.
Then, you'll create the exact same DataFrame using a pivot table. This is a dandy example of how pandas often gives you more than than one fashion to reach the same result!
Instructions
- Unstack the
arrest_rateSeries to reshape it into a DataFrame. - Create the exact same DataFrame using a pivot tabular array! Each of the three
.pivot_table()parameters should be specified as one of theri_weathercolumns.
# Unstack the 'arrest_rate' Serial into a DataFrame impress(arrest_rate.unstack()) # Create the same DataFrame using a pivot table impress(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
<script.py> output: rating good bad worse violation Equipment 0.059007 0.066311 0.097357 Moving violation 0.056227 0.058050 0.065860 Other 0.076966 0.087443 0.062893 Registration/plates 0.081574 0.098160 0.115625 Seat belt 0.028587 0.022493 0.000000 Speeding 0.013405 0.013314 0.016886 rating adept bad worse violation Equipment 0.059007 0.066311 0.097357 Moving violation 0.056227 0.058050 0.065860 Other 0.076966 0.087443 0.062893 Registration/plates 0.081574 0.098160 0.115625 Seat belt 0.028587 0.022493 0.000000 Speeding 0.013405 0.013314 0.016886
In the future, when you need to create a DataFrame like this, you can choose whichever method makes the near sense to you lot.
Source: https://www.hylkerozema.nl/2021/10/21/examining-the-dataset/
0 Response to "After Uploading the Dataset in R I Have More Rows With False"
Postar um comentário