After Uploading the Dataset in R I Have More Rows With False

Examining the dataset

Throughout this form, y'all'll be analyzing a dataset of traffic stops in Rhode Island that was nerveless by the Stanford Open Policing Projection.

Earlier beginning your analysis, it'south important that you lot familiarize yourself with the dataset. In this practice, you'll read the dataset into pandas, examine the kickoff few rows, and so count the number of missing values.

Instructions
  • Importpandas using the allonympd.
  • Read the filelaw.csv into a DataFrame namedri.
  • Examine the first 5 rows of the DataFrame (known every bit the "head").
  • Count the number of missing values in each cavalcade: Use.isnull() to check which DataFrame elements are missing, and so have the.sum() to count the number ofTruthful values in each column.
# Import the pandas library equally pd import pandas as pd  # Read 'law.csv' into a DataFrame named ri ri = pd.read_csv('police.csv')  # Examine the caput of the DataFrame print(ri.head())  # Count the number of missing values in each column print(ri.isnull().sum())
<script.py> output:       state   stop_date stop_time  county_name driver_gender driver_race  \     0    RI  2005-01-04     12:55          NaN             Yard       White        one    RI  2005-01-23     23:15          NaN             M       White        2    RI  2005-02-17     04:15          NaN             M       White        3    RI  2005-02-20     17:15          NaN             M       White        iv    RI  2005-02-24     01:twenty          NaN             F       White                                 violation_raw  violation  search_conducted search_type  \     0  Equipment/Inspection Violation  Equipment             Simulated         NaN        1                        Speeding   Speeding             False         NaN        2                        Speeding   Speeding             Fake         NaN        three                Phone call for Service      Other             False         NaN        4                        Speeding   Speeding             False         NaN                 stop_outcome is_arrested stop_duration  drugs_related_stop commune       0       Citation       False      0-15 Min               Simulated  Zone X4       1       Citation       Fake      0-fifteen Min               Fake  Zone K3       2       Citation       False      0-15 Min               False  Zone X4       three  Arrest Driver        True     16-30 Min               False  Zone X1       4       Commendation       Fake      0-15 Min               Faux  Zone X3       state                     0     stop_date                 0     stop_time                 0     county_name           91741     driver_gender          5205     driver_race            5202     violation_raw          5202     violation              5202     search_conducted          0     search_type           88434     stop_outcome           5202     is_arrested            5202     stop_duration          5202     drugs_related_stop        0     district                  0     dtype: int64

Information technology looks like most of the columns have at least some missing values. We'll figure out how to handle these values in the side by side exercise!

Dropping columns

Oft, a DataFrame volition incorporate columns that are not useful to your assay. Such columns should be dropped from the DataFrame, to get in easier for y'all to focus on the remaining columns.

In this exercise, you'll drop thecounty_name cavalcade considering it but contains missing values, and you'll drop thecountry column because all of the traffic stops took identify in ane state (Rhode Island). Thus, these columns tin be dropped because they contain no useful information. The number of missing values in each cavalcade has been printed to the console for you lot.

Instructions
  • Examine the DataFrame's.shape to find out the number of rows and columns.
  • Drop both thecounty_name andstate columns past passing the cavalcade names to the.drop() method as a list of strings.
  • Examine the.shape once more to verify that there are now two fewer columns.
# Examine the shape of the DataFrame print(ri.shape)  # Drop the 'county_name' and 'state' columns ri.drop(['county_name', 'land'], centrality='columns', inplace=True)  # Examine the shape of the DataFrame (again) print(ri.shape)
<script.py> output:     (91741, 15)     (91741, xiii)

Nosotros'll continue to remove unnecessary information from the DataFrame in the adjacent exercise.

Dropping rows

When yous know that a specific column will be disquisitional to your analysis, and only a small-scale fraction of rows are missing a value in that cavalcade, information technology often makes sense to remove those rows from the dataset.

During this grade, thedriver_gender column will be disquisitional to many of your analyses. Considering only a small fraction of rows are missingdriver_gender, we'll drop those rows from the dataset.

Instructions
  • Count the number of missing values in each column.
  • Drop all rows that are missingdriver_gender by passing the column proper noun to thesubset parameter of.dropna().
  • Count the number of missing values in each column again, to verify that none of the remaining rows are missingdriver_gender.
  • Examine the DataFrame's.shape to come across how many rows and columns remain.
# Count the number of missing values in each cavalcade print(ri.isnull().sum())  # Driblet all rows that are missing 'driver_gender' ri.dropna(subset=['driver_gender'], inplace=True)  # Count the number of missing values in each column (again) print(ri.isnull().sum())  # Examine the shape of the DataFrame print(ri.shape)
<script.py> output:     stop_date                 0     stop_time                 0     driver_gender          5205     driver_race            5202     violation_raw          5202     violation              5202     search_conducted          0     search_type           88434     stop_outcome           5202     is_arrested            5202     stop_duration          5202     drugs_related_stop        0     commune                  0     dtype: int64     stop_date                 0     stop_time                 0     driver_gender             0     driver_race               0     violation_raw             0     violation                 0     search_conducted          0     search_type           83229     stop_outcome              0     is_arrested               0     stop_duration             0     drugs_related_stop        0     district                  0     dtype: int64     (86536, thirteen)

We dropped around five,000 rows, which is a small-scale fraction of the dataset, and at present simply one column remains with any missing values.

Fixing a data type

We saw in the previous exercise that theis_arrested cavalcade currently has theobject data type. In this exercise, we'll change the data type tobool, which is the most suitable type for a column containingTruthful andFalse values.

Fixing the data type will enable the states to use mathematical operations on theis_arrested column that would not be possible otherwise.

Instructions
  • Examine the head of theis_arrested cavalcade to verify that it containsTrueandFalse values and to bank check the column's data type.
  • Use the.astype() method to convertis_arrested to abool column.
  • Check the new data type ofis_arrested to ostend that it is now aboolcolumn.
# Examine the head of the 'is_arrested' cavalcade impress(ri.is_arrested.head())  # Change the data type of 'is_arrested' to 'bool' ri['is_arrested'] = ri.is_arrested.astype('bool')  # Check the data type of 'is_arrested'  print(ri.is_arrested.dtypes)
0    False ane    False 2    Fake 3     True 4    Simulated Name: is_arrested, dtype: bool bool

It's best to ready these data blazon problems early on, before you begin your analysis.

Combining object columns

Currently, the date and time of each traffic stop are stored in separate object columns:stop_date andstop_time.

In this exercise, you'll combine these two columns into a single column, and and so convert information technology todatetime format. This will enable convenient date-based attributes that we'll use later in the form.

Instructions
  • Use a string method to concatenatestop_date andstop_time (separated by a space), and store the result incombined.
  • Catechumencombined todatetime format, and store the consequence in a new column namedstop_datetime.
  • Examine the DataFrame.dtypes to confirm thatstop_datetime is adatetimecavalcade.
# Concatenate 'stop_date' and 'stop_time' (separated by a space) combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')  # Catechumen 'combined' to datetime format ri['stop_datetime'] = pd.to_datetime(combined)  # Examine the data types of the DataFrame impress(ri.dtypes)
<script.py> output:     stop_date                     object     stop_time                     object     driver_gender                 object     driver_race                   object     violation_raw                 object     violation                     object     search_conducted                bool     search_type                   object     stop_outcome                  object     is_arrested                     bool     stop_duration                 object     drugs_related_stop              bool     district                      object     stop_datetime         datetime64[ns]     dtype: object

At present we're set to gear up thestop_datetimecolumn as the index.

Setting the alphabetize

The last step that you'll accept in this chapter is to ready thestop_datetime column as the DataFrame's index. By replacing the default index with aDatetimeIndex, you'll make information technology easier to clarify the dataset past appointment and time, which volition come in handy later in the class!

Instructions
  • Setstop_datetime as the DataFrame alphabetize.
  • Examine the index to verify that it is aDatetimeIndex.
  • Examine the DataFrame columns to ostend thatstop_datetime is no longer one of the columns.
<script.py> output:     DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:xv:00',                    '2005-02-17 04:fifteen:00', '2005-02-20 17:15:00',                    '2005-02-24 01:20:00', '2005-03-14 10:00:00',                    '2005-03-29 21:55:00', '2005-04-04 21:25:00',                    '2005-07-xiv xi:20:00', '2005-07-14 19:55:00',                    ...                    '2015-12-31 13:23:00', '2015-12-31 xviii:59:00',                    '2015-12-31 xix:thirteen:00', '2015-12-31 xx:20:00',                    '2015-12-31 20:50:00', '2015-12-31 21:21:00',                    '2015-12-31 21:59:00', '2015-12-31 22:04:00',                    '2015-12-31 22:09:00', '2015-12-31 22:47:00'],                   dtype='datetime64[ns]', proper noun='stop_datetime', length=86536, freq=None)     Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',            'violation_raw', 'violation', 'search_conducted', 'search_type',            'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',            'district'],           dtype='object')

Now that you accept cleaned the dataset, you tin begin analyzing it in the side by side chapter.

Examining traffic violations

Before comparing the violations existence committed by each gender, yous should examine the violations committed by all drivers to get a baseline agreement of the data.

In this do, yous'll count the unique values in theviolation column, and then separately limited those counts as proportions.

Instructions
  • Count the unique values in theviolation column of theri DataFrame, to run across what violations are being committed by all drivers.
  • Express the violation counts equally proportions of the total.
# Count the unique values in 'violation' print(ri.violation.value_counts())  # Express the counts equally proportions print(ri.violation.value_counts(normalize=True))
<script.py> output:     Speeding               48423     Moving violation       16224     Equipment              10921     Other                   4409     Registration/plates     3703     Seat belt               2856     Name: violation, dtype: int64     Speeding               0.559571     Moving violation       0.187483     Equipment              0.126202     Other                  0.050950     Registration/plates    0.042791     Seat belt              0.033004     Proper noun: violation, dtype: float64

More half of all violations are for speeding, followed by other moving violations and equipment violations.

Comparing violations past gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, you lot'll first create a DataFrame for each gender, and then clarify the violations in each DataFrame separately.

Instructions
  • Create a DataFrame,female, that just contains rows in whichdriver_gender is'F'.
  • Create a DataFrame,male, that merely contains rows in whichdriver_gender is'M'.
  • Count the violations committed by female drivers and limited them as proportions.
  • Count the violations committed by male drivers and express them as proportions.
# Create a DataFrame of female drivers female = ri[ri['driver_gender']=='F']  # Create a DataFrame of male drivers male = ri[ri['driver_gender']=='M']  # Compute the violations past female drivers (as proportions) print(female.violation.value_counts(normalize=True))  # Compute the violations past male drivers (every bit proportions) print(male.violation.value_counts(normalize=True))
<script.py> output:     Speeding               0.658114     Moving violation       0.138218     Equipment              0.105199     Registration/plates    0.044418     Other                  0.029738     Seat belt              0.024312     Name: violation, dtype: float64     Speeding               0.522243     Moving violation       0.206144     Equipment              0.134158     Other                  0.058985     Registration/plates    0.042175     Seat chugalug              0.036296     Proper name: violation, dtype: float64

About two-thirds of female person traffic stops are for speeding, whereas stops of males are more balanced among the 6 categories. This doesn't mean that females speed more than often than males, nevertheless, since we didn't take into account the number of stops or drivers.

Comparison speeding outcomes by gender

When a commuter is pulled over for speeding, many people believe that gender has an affect on whether the driver will receive a ticket or a warning. Tin can you find evidence of this in the dataset?

Starting time, you'll create 2 DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, you'll use thestop_outcome column to summate what percent of stops resulted in a "Commendation" (pregnant a ticket) versus a "Warning".

Instructions
  • Create a DataFrame,female_and_speeding, that only includes female drivers who were stopped for speeding.
  • Create a DataFrame,male_and_speeding, that but includes male drivers who were stopped for speeding.
  • Count the stop outcomes for the female drivers and limited them as proportions.
  • Count the stop outcomes for the male drivers and express them as proportions.
# Create a DataFrame of female drivers stopped for speeding female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]  # Create a DataFrame of male drivers stopped for speeding male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]  # Compute the cease outcomes for female person drivers (as proportions) print(female_and_speeding.stop_outcome.value_counts(normalize = True))  # Compute the stop outcomes for male drivers (as proportions) print(male_and_speeding.stop_outcome.value_counts(normalize = True))
Citation            0.952192 Alert             0.040074 Abort Driver       0.005752 N/D                 0.000959 Arrest Passenger    0.000639 No Activeness           0.000383 Name: stop_outcome, dtype: float64  Citation            0.944595 Alert             0.036184 Arrest Driver       0.015895 Abort Passenger    0.001281 No Activity           0.001068 Due north/D                 0.000976 Name: stop_outcome, dtype: float64

The numbers are similar for males and females: about 95% of stops for speeding consequence in a ticket. Thus, the data fails to prove that gender has an impact on who gets a ticket for speeding.

Computing the search rate

During a traffic terminate, the constabulary officer sometimes conducts a search of the vehicle. In this exercise, you lot'll calculate the percent of all stops in theri DataFrame that result in a vehicle search, too known as the search charge per unit.

Instructions
  • Bank check the data type ofsearch_conducted to confirm that it's a Boolean Series.
  • Calculate the search rate by counting the Series values and expressing them every bit proportions.
  • Calculate the search rate by taking the mean of the Series. (It should match the proportion ofTrue values calculated to a higher place.)
# Bank check the data type of 'search_conducted' impress(ri.search_conducted.dtypes)  # Calculate the search rate by counting the values print(ri.search_conducted.value_counts(normalize = Truthful))  # Calculate the search rate by taking the mean print(ri.search_conducted.mean())
bool False    0.961785 True     0.038215 Proper noun: search_conducted, dtype: float64 0.0382153092354627

Information technology looks like the search rate is about iii.viii%. Side by side, you'll examine whether the search charge per unit varies past driver gender.

Comparing search rates past gender

In this do, you'll compare the rates at which female person and male person drivers are searched during a traffic finish. Call up that the vehicle search rate beyond all stops is nigh three.viii%.

First, you lot'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a.groupby().

Instructions i/3
  • one/iii Filter the DataFrame to only include female person drivers, then calculate the search rate past taking the mean ofsearch_conducted.
# Calculate the search rate for female drivers print(ri[ri.driver_gender == 'F'].search_conducted.mean())


2/3

  • Filter the DataFrame to but include male drivers, then repeat the search charge per unit adding.
# Calculate the search rate for male person drivers print(ri[ri['driver_gender'] == 'M'].search_conducted.mean())
Instructions 3/three
  • Grouping by driver gender to calculate the search charge per unit for both groups simultaneously. (It should match the previous results.)
# Calculate the search rate for both groups simultaneously print(ri.groupby('driver_gender').search_conducted.mean())
<script.py> output:     driver_gender     F    0.019181     M    0.045426     Name: search_conducted, dtype: float64

Male drivers are searched more than than twice as oftentimes every bit female drivers. Why might this be?

Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it'due south possible that the difference is mostly due to a 2nd gene.

For example, you lot might hypothesize that the search rate varies past violation type, and the difference in search rate between males and females is because they tend to commit unlike violations.

You can exam this hypothesis past examining the search rate for each combination of gender and violation. If the hypothesis was true, you lot would find that males and females are searched at about the same charge per unit for each violation. Find out below if that's the case!

Instructions 1/2
  • Utilise a.groupby() to calculate the search charge per unit for each combination of gender and violation. Are males and females searched at nigh the same rate for each violation?
# Calculate the search rate for female drivers print(ri[ri.driver_gender == 'F'].search_conducted.mean())

2/3 Filter the DataFrame to but include male drivers, and and then repeat the search rate calculation.

# Calculate the search rate for male drivers impress(ri[ri['driver_gender'] == 'M'].search_conducted.hateful())

iii/3 Group by driver gender to calculate the search rate for both groups simultaneously. (It should friction match the previous results.)

# Calculate the search rate for both groups simultaneously print(ri.groupby('driver_gender').search_conducted.mean())
driver_gender F    0.019181 M    0.045426 Name: search_conducted, dtype: float64

Male person drivers are searched more than twice equally frequently every bit female drivers. Why might this be?

Calculation a second gene to the analysis

Even though the search rate for males is much higher than for females, it'southward possible that the difference is more often than not due to a second factor.

For example, y'all might hypothesize that the search rate varies past violation type, and the difference in search charge per unit between males and females is considering they tend to commit different violations.

You lot can test this hypothesis past examining the search charge per unit for each combination of gender and violation. If the hypothesis was truthful, y'all would find that males and females are searched at about the aforementioned charge per unit for each violation. Detect out below if that'south the case!

Instructions two/2
  • Reverse the ordering to group by violation earlier gender. The results may be easier to compare when presented this way.
# Reverse the ordering to group by violation before gender impress(ri.groupby(['violation', 'driver_gender']).search_conducted.hateful())
violation            driver_gender Equipment            F                0.039984                      1000                0.071496 Moving violation     F                0.039257                      M                0.061524 Other                F                0.041018                      M                0.046191 Registration/plates  F                0.054924                      M                0.108802 Seat chugalug            F                0.017301                      M                0.035119 Speeding             F                0.008309                      M                0.027885 Name: search_conducted, dtype: float64

For all types of violations, the search charge per unit is higher for males than for females, disproving our hypothesis.

Counting protective frisks

During a vehicle search, the law officer may pat downwards the commuter to check if they take a weapon. This is known as a "protective frisk."

In this exercise, you'll outset check to come across how many times "Protective Frisk" was the only search type. So, you'll utilise a string method to locate all instances in which the driver was frisked.

Instructions
  • Count thesearch_type values in theri DataFrame to see how many times "Protective Frisk" was the only search type.
  • Create a new column,frisk, that isTruthful ifsearch_type contains the cord "Protective Frisk" andFalse otherwise.
  • Bank check the data type offrisk to confirm that it's a Boolean Serial.
  • Take the sum offrisk to count the total number of frisks.
# Count the 'search_type' values print(ri.search_type.value_counts())  # Check if 'search_type' contains the string 'Protective Frisk' ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)  # Bank check the information type of 'frisk' print(ri['frisk'].dtypes)  # Take the sum of 'frisk' print(ri['frisk'].sum())
Incident to Arrest                                          1290 Probable Cause                                               924 Inventory                                                    219 Reasonable Suspicion                                         214 Protective Frisk                                             164 Incident to Arrest,Inventory                                 123 Incident to Arrest,Likely Crusade                            100 Probable Cause,Reasonable Suspicion                           54 Probable Cause,Protective Frisk                               35 Incident to Arrest,Inventory,Probable Cause                   35 Incident to Arrest,Protective Frisk                           33 Inventory,Probable Cause                                      25 Protective Frisk,Reasonable Suspicion                         xix Incident to Arrest,Inventory,Protective Frisk                 18 Incident to Arrest,Likely Cause,Protective Frisk            13 Inventory,Protective Frisk                                    12 Incident to Arrest,Reasonable Suspicion                        eight Incident to Arrest,Probable Cause,Reasonable Suspicion         five Probable Cause,Protective Frisk,Reasonable Suspicion           v Incident to Abort,Inventory,Reasonable Suspicion              4 Inventory,Reasonable Suspicion                                 ii Incident to Arrest,Protective Frisk,Reasonable Suspicion       2 Inventory,Likely Crusade,Reasonable Suspicion                  one Inventory,Protective Frisk,Reasonable Suspicion                one Inventory,Likely Cause,Protective Frisk                      1 Name: search_type, dtype: int64 bool 303

It looks similar there were 303 drivers who were frisked. Next, you'll examine whether gender affects who is frisked.

Comparing frisk rates past gender

In this exercise, you'll compare the rates at which female and male person drivers are frisked during a search. Are males frisked more often than females, peradventure because police officers consider them to exist college gamble?

Before doing any calculations, information technology'south of import to filter the DataFrame to only include the relevant subset of information, namely stops in which a search was conducted.

Instructions
  • Create a DataFrame,searched, that only contains rows in whichsearch_conducted isTrue.
  • Take the hateful of thefrisk column to observe out what percent of searches included a frisk.
  • Calculate the frisk rate for each gender using a.groupby().
# Create a DataFrame of stops in which a search was conducted searched = ri[ri.search_conducted == True]  # Calculate the overall frisk rate by taking the mean of 'frisk' print(searched.frisk.mean())  # Calculate the frisk rate for each gender print(searched.groupby('driver_gender').frisk.hateful())
0.09162382824312065 driver_gender F    0.074561 M    0.094353 Proper name: frisk, dtype: float64

The frisk rate is higher for males than for females, though we can't conclude that this deviation is acquired by the driver's gender.

Calculating the hourly arrest rate

When a police officeholder stops a driver, a modest pct of those stops ends in an arrest. This is known equally the abort rate. In this exercise, you'll find out whether the arrest rate varies past time of day.

First, y'all'll calculate the arrest rate across all stops in theri DataFrame. And so, yous'll calculate the hourly abort rate by using thehr aspect of the index. Thehr ranges from 0 to 23, in which:

  • 0 = midnight
  • 12 = noon
  • 23 = xi PM
Instructions
  • Take the mean of theis_arrested column to summate the overall arrest rate.
  • Group by thehr attribute of the DataFrame index to calculate the hourly arrest rate.
  • Salve the hourly arrest rate Serial as a new object,hourly_arrest_rate.
# Calculate the overall arrest rate print(ri.is_arrested.mean())  # Summate the hourly arrest rate print(ri.groupby(ri.index.hour).is_arrested.mean())  # Save the hourly arrest rate hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.hateful()
<script.py> output:     0.0355690117407784     stop_datetime     0     0.051431     1     0.064932     2     0.060798     3     0.060549     iv     0.048000     5     0.042781     6     0.013813     vii     0.013032     eight     0.021854     9     0.025206     ten    0.028213     11    0.028897     12    0.037399     13    0.030776     14    0.030605     xv    0.030679     16    0.035281     17    0.040619     eighteen    0.038204     19    0.032245     20    0.038107     21    0.064541     22    0.048666     23    0.047592     Proper noun: is_arrested, dtype: float64

Next you'll plot the data so that you can visually examine the arrest rate trends.

Plotting the hourly abort rate

In this exercise, you'll create a line plot from thehourly_arrest_rate object. A line plot is appropriate in this case because you're showing how a quantity changes over fourth dimension.

This plot should help you to spot some trends that may not accept been obvious when examining the raw numbers!

Instructions
  • Importmatplotlib.pyplot using the aliasplt.
  • Create a line plot ofhourly_arrest_rate using the.plot() method.
  • Label the 10-centrality every bit'Hour', label the y-centrality as'Arrest Rate', and championship the plot'Arrest Charge per unit by Time of Twenty-four hours'.
  • Display the plot using the.show() office.
# Import matplotlib.pyplot as plt import matplotlib.pyplot as plt  # Create a line plot of 'hourly_arrest_rate' plt.plot(hourly_arrest_rate)  # Add the xlabel, ylabel, and title plt.xlabel('Hour') plt.ylabel('Abort Rate') plt.title('Arrest Rate by Time of 24-hour interval')  # Display the plot plt.show()

The abort rate has a meaning spike overnight, then dips in the early morn hours.

In a small portion of traffic stops, drugs are found in the vehicle during a search. In this practise, yous'll assess whether these drug-related stops are condign more than mutual over fourth dimension.

The Boolean cavalcadedrugs_related_stop indicates whether drugs were institute during a given stop. You'll calculate the almanac drug rate by resampling this column, and so you'll use a line plot to visualize how the charge per unit has changed over time.

Instructions
  • Calculate the annual rate of drug-related stops past resampling thedrugs_related_stop column (on the'A' frequency) and taking the mean.
  • Save the annual drug rate Series as a new object,annual_drug_rate.
  • Create a line plot ofannual_drug_rate using the.plot() method.
  • Display the plot using the.show() function.
# Calculate the annual rate of drug-related stops print(ri.drugs_related_stop.resample('A').mean())  # Save the annual rate of drug-related stops annual_drug_rate = ri.drugs_related_stop.resample('A').mean()  # Create a line plot of 'annual_drug_rate' plt.plot(annual_drug_rate)  # Brandish the plot plt.evidence()
<script.py> output:     stop_datetime     2005-12-31    0.006501     2006-12-31    0.007258     2007-12-31    0.007970     2008-12-31    0.007505     2009-12-31    0.009889     2010-12-31    0.010081     2011-12-31    0.009731     2012-12-31    0.009921     2013-12-31    0.013094     2014-12-31    0.013826     2015-12-31    0.012266     Freq: A-Dec, Name: drugs_related_stop, dtype: float64

The rate of drug-related stops virtually doubled over the course of x years. Why might that be the case?

Comparison drug and search rates

Every bit you saw in the final do, the rate of drug-related stops increased significantly between 2005 and 2015. You might hypothesize that the charge per unit of vehicle searches was also increasing, which would take led to an increase in drug-related stops even if more drivers were not carrying drugs.

You lot can test this hypothesis past calculating the annual search charge per unit, and then plotting it against the almanac drug rate. If the hypothesis is truthful, then y'all'll see both rates increasing over time.

Instructions
  • Calculate the annual search rate by resampling thesearch_conducted cavalcade, and save the result every bitannual_search_rate.
  • Concatenateannual_drug_rate andannual_search_rate along the columns axis, and relieve the consequence asannual.
  • Create subplots of the drug and search rates from theannual DataFrame.
  • Display the subplots.
# Calculate and save the almanac search charge per unit annual_search_rate = ri.search_conducted.resample('A').hateful()  # Concatenate 'annual_drug_rate' and 'annual_search_rate' almanac = pd.concat([annual_drug_rate, annual_search_rate], axis='columns')  # Create subplots from 'annual' annual.plot(subplots = True)  # Brandish the subplots plt.show()

The charge per unit of drug-related stops increased even though the search rate decreased, disproving our hypothesis.

Crosstab : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Tallying violations by district

The state of Rhode Island is broken into half dozen police districts, also known as zones. How do the zones compare in terms of what violations are caught by constabulary?

In this practice, you lot'll create a frequency table to determine how many violations of each type took place in each of the six zones. Then, you'll filter the table to focus on the "K" zones, which you'll examine further in the next exercise.

Instructions
  • Create a frequency table from theri DataFrame'sdistrict andviolationcolumns using thepd.crosstab() function.
  • Salvage the frequency table as a new object,all_zones.
  • Select rows'Zone K1' through'Zone K3' fromall_zones using the.loc[]accessor.
  • Save the smaller tabular array as a new object,k_zones.
# Create a frequency table of districts and violations print(pd.crosstab(ri.commune, ri.violation))  # Relieve the frequency table equally 'all_zones' all_zones = pd.crosstab(ri.district, ri.violation)  # Select rows 'Zone K1' through 'Zone K3' impress(all_zones.loc['Zone K1':'Zone K3'])  # Save the smaller table equally 'k_zones' k_zones = all_zones.loc['Zone K1':'Zone K3']
violation  Equipment  Moving violation  Other  Registration/plates  Seat belt  \ district                                                                         Zone K1          672              1254    290                  120          0    Zone K2         2061              2962    942                  768        481    Zone K3         2302              2898    705                  695        638    Zone X1          296               671    143                   38         74    Zone X3         2049              3086    769                  671        820    Zone X4         3541              5353   1560                 1411        843     violation  Speeding   district              Zone K1        5960   Zone K2       10448   Zone K3       12322   Zone X1        1119   Zone X3        8779   Zone X4        9795   violation  Equipment  Moving violation  Other  Registration/plates  Seat belt  \ district                                                                         Zone K1          672              1254    290                  120          0    Zone K2         2061              2962    942                  768        481    Zone K3         2302              2898    705                  695        638     violation  Speeding   district              Zone K1        5960   Zone K2       10448   Zone K3       12322              

Side by side you'll plot the violations so that you can compare these districts.

Plotting violations by commune

Now that you lot've created a frequency table focused on the "One thousand" zones, you'll visualize the information to assistance you compare what violations are beingness caught in each zone.

Get-go you'll create a bar plot, which is an appropriate plot blazon since you're comparison categorical data. So you lot'll create a stacked bar plot in gild to get a slightly different look at the information. Which plot do you observe to be more than insightful?

Instructions 1/2
  • Create a bar plot ofk_zones.
  • Display the plot and examine it. What do you notice about each of the zones?
# Create a bar plot of 'k_zones' k_zones.plot(kind= 'bar')  # Display the plot plt.evidence()
  • Create a stacked bar plot ofk_zones.
  • Display the plot and examine it. Practise y'all notice anything different nearly the data than you lot did previously?
# Create a stacked bar plot of 'k_zones' k_zones.plot(kind='bar', stacked = True)  # Display the plot plt.show()

Interesting! The vast majority of traffic stops in Zone K1 are for speeding, and Zones K2 and K3 are remarkably similar to one some other in terms of violations.

Converting cease durations to numbers

In the traffic stops dataset, thestop_duration column tells you approximately how long the driver was detained by the officer. Unfortunately, the durations are stored as strings, such equally'0-15 Min'. How tin can you make this data easier to analyze?

In this exercise, you lot'll convert the end durations to integers. Because the precise durations are not available, you'll have to estimate the numbers using reasonable values:

  • Convert'0-15 Min' to8
  • Convert'xvi-30 Min' to23
  • Convert'30+ Min' to45
Instructions
  • Print the unique values in thestop_duration cavalcade. (This has been done for y'all.)
  • Create a dictionary calledmapping that maps thestop_duration strings to the integers specified in a higher place.
  • Catechumen thestop_duration strings to integers using themapping, and store the results in a new column calledstop_minutes.
  • Print the unique values in thestop_minutes column, to verify that the durations were properly converted to integers.
# Impress the unique values in 'stop_duration' impress(ri.stop_duration.unique())  # Create a dictionary that maps strings to integers mapping = {'0-15 Min':8, 'xvi-30 Min':23, '30+ Min':45}  # Convert the 'stop_duration' strings to integers using the 'mapping' ri['stop_minutes'] = ri.stop_duration.map(mapping)  # Print the unique values in 'stop_minutes' print(ri.stop_minutes.unique())
<script.py> output:     ['0-fifteen Min' 'sixteen-xxx Min' '30+ Min']     [ 8 23 45]

Adjacent y'all'll analyze the end length for each type of violation.

Plotting stop length

If y'all were stopped for a item violation, how long might you look to be detained?

In this exercise, you'll visualize the average length of fourth dimension drivers are stopped for each type of violation. Rather than using theviolation cavalcade in this exercise, yous'll utilizeviolation_raw since it contains more detailed descriptions of the violations.

Instructions
  • For each value in theri DataFrame'southwardviolation_raw cavalcade, calculate the hateful number ofstop_minutes that a commuter is detained.
  • Save the resulting Series as a new object,stop_length.
  • Sortstop_length by its values, and then visualize it using a horizontal bar plot.
  • Display the plot.
# Summate the hateful 'stop_minutes' for each value in 'violation_raw' print(ri.groupby('violation_raw').stop_minutes.mean())  # Save the resulting Series equally 'stop_length' stop_length = ri.groupby('violation_raw').stop_minutes.hateful()  # Sort 'stop_length' by its values and create a horizontal bar plot stop_length.sort_values().plot(kind='barh')  # Brandish the plot plt.prove()
<script.py> output:     violation_raw     APB                                 17.967033     Call for Service                    22.124371     Equipment/Inspection Violation      11.445655     Motorist Assistance/Courtesy            17.741463     Other Traffic Violation             13.844490     Registration Violation              13.736970     Seatbelt Violation                   9.662815     Special Detail/Directed Patrol      15.123632     Speeding                            ten.581562     Suspicious Person                   fourteen.910714     Violation of City/Town Ordinance    13.254144     Warrant                             24.055556     Name: stop_minutes, dtype: float64

You've completed the chapter on visual exploratory information assay!

Plotting the temperature

In this exercise, y'all'll examine the temperature columns from the atmospheric condition dataset to assess whether the data seems trustworthy. First you'll print the summary statistics, and and then yous'll visualize the data using a box plot.

When deciding whether the values seem reasonable, go along in heed that the temperature is measured in degrees Fahrenheit, non Celsius!

Instructions
  • Readconditions.csv into a DataFrame namedconditions.
  • Select the temperature columns (TMIN,TAVG,TMAX) and print their summary statistics using the.describe() method.
  • Create a box plot to visualize the temperature columns.
  • Display the plot.
# Read 'atmospheric condition.csv' into a DataFrame named 'weather' weather = pd.read_csv('conditions.csv')  # Describe the temperature columns print(atmospheric condition[['TMIN', 'TAVG', 'TMAX']].describe())  # Create a box plot of the temperature columns weather.plot(kind='box')  # Brandish the plot plt.show()
<script.py> output:                   TMIN         TAVG         TMAX     count  4017.000000  1217.000000  4017.000000     mean     43.484441    52.493016    61.268608     std      17.020298    17.830714    eighteen.199517     min      -5.000000     vi.000000    15.000000     25%      30.000000    39.000000    47.000000     fifty%      44.000000    54.000000    62.000000     75%      58.000000    68.000000    77.000000     max      77.000000    86.000000   102.000000              

The temperature data looks good then far: theTAVG values are in betwixtTMIN andTMAX, and the measurements and ranges seem reasonable.

Plotting the temperature divergence

In this exercise, you'll continue to appraise whether the dataset seems trustworthy past plotting the departure between the maximum and minimum temperatures.

What do you lot notice about the resulting histogram? Does it match your expectations, or do you lot meet anything unusual?

Instructions
  • Create a new column in theweather DataFrame namedTDIFF that represents the difference betwixt the maximum and minimum temperatures.
  • Print the summary statistics forTDIFF using the.describe() method.
  • Create a histogram with xx bins to visualizeTDIFF.
  • Display the plot.
# Create a 'TDIFF' column that represents temperature difference weather['TDIFF'] = weather.TMAX - weather.TMIN  # Describe the 'TDIFF' column print(weather.TDIFF.depict())  # Create a histogram with 20 bins to visualize 'TDIFF' weather.TDIFF.plot(kind = 'hist', bins = 20)  # Brandish the plot plt.show()

TheTDIFF column has no negative values and its distribution is approximately normal, both of which are signs that the data is trustworthy.

<script.py> output:     count    4017.000000     hateful       17.784167     std         6.350720     min         ii.000000     25%        xiv.000000     l%        eighteen.000000     75%        22.000000     max        43.000000     Name: TDIFF, dtype: float64

Counting bad weather conditions

Theconditions DataFrame contains xx columns that showtime with'WT', each of which represents a bad conditions condition. For case:

  • WT05 indicates "Hail"
  • WT11 indicates "Loftier or damaging winds"
  • WT17 indicates "Freezing rain"

For every row in the dataset, eachWT column contains either aone (meaning the condition was present that day) orNaN (pregnant the condition was not present).

In this exercise, you'll quantify "how bad" the weather was each mean solar day by counting the number ofone values in each row.

Instructions
  • Copy the columnsWT01 throughWT22 fromweather to a new DataFrame namedWT.
  • Calculate the sum of each row inWT, and store the results in a newweathercolumn namedbad_conditions.
  • Supersede any missing values inbad_conditions with a0. (This has been done for you.)
  • Create a histogram to visualizebad_conditions, and so display the plot.
# Copy 'WT01' through 'WT22' to a new DataFrame WT = weather.loc[:, 'WT01':'WT22']  # Calculate the sum of each row in 'WT' atmospheric condition['bad_conditions'] = WT.sum(axis='columns')  # Replace missing values in 'bad_conditions' with '0' weather['bad_conditions'] = conditions.bad_conditions.fillna(0).astype('int')  # Create a histogram to visualize 'bad_conditions' atmospheric condition.bad_conditions.plot(kind='hist')  # Brandish the plot plt.show()

Information technology looks like many days didn't accept whatever bad weather conditions, and merely a small portion of dayaps had more than 4 bad weather conditions.

Rating the weather condition conditions

In the previous exercise, you counted the number of bad weather weather each twenty-four hour period. In this do, you lot'll use the counts to create a rating organization for the weather.

The counts range from 0 to nine, and should be converted to ratings as follows:

  • Catechumen0 to'practiced'
  • Converti through4 to'bad'
  • Convert5 through9 to'worse'
Instructions
  • Count the unique values in thebad_conditions column and sort the alphabetize. (This has been done for you lot.)
  • Create a dictionary calledmapping that maps thebad_conditions integers to strings as specified in a higher place.
  • Convert thebad_conditions integers to strings using themapping and shop the results in a new column calledrating.
  • Count the unique values inrating to verify that the integers were properly converted to strings.
# Count the unique values in 'bad_conditions' and sort the index print(weather condition.bad_conditions.value_counts().sort_index())  # Create a dictionary that maps integers to strings mapping = {0:'practiced', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', vii:'worse', viii:'worse', 9:'worse'}  # Convert the 'bad_conditions' integers to strings using the 'mapping' weather['rating'] = weather.bad_conditions.map(mapping)  # Count the unique values in 'rating' print(atmospheric condition.rating.value_counts())
<script.py> output:     0    1749     i     613     two     367     3     380     4     476     5     282     6     101     7      41     8       4     nine       4     Name: bad_conditions, dtype: int64     bad      1836     good     1749     worse     432     Proper noun: rating, dtype: int64

This rating system should make the weather condition data easier to understand.

Changing the data type to category

Since therating column merely has a few possible values, y'all'll change its information type to category in order to shop the data more efficiently. You'll as well specify a logical order for the categories, which will be useful for future exercises.

Instructions
  • Create a list object calledcats that lists the weather ratings in a logical order:'skilful','bad','worse'.
  • Change the information type of therating column from object to category. Make certain to use thecats listing to define the category ordering.
  • Examine the head of therating column to confirm that the categories are logically ordered.
# Create a listing of weather ratings in logical order cats =['skillful', 'bad', 'worse']  # Alter the data blazon of 'rating' to category atmospheric condition['rating'] = weather.rating.astype('category', ordered=Truthful, categories=cats)  # Examine the head of 'rating' impress(weather.rating.caput())
<script.py> output:     0    bad     1    bad     2    bad     3    bad     4    bad     Name: rating, dtype: category     Categories (3, object): [skillful < bad < worse]

You'll apply therating column in futurity exercises to analyze the effects of weather condition on police behavior.

Preparing the DataFrames

In this exercise, you'll prepare the traffic stop and weather condition rating DataFrames and then that they're ready to be merged:

  1. With theri DataFrame, you'll move thestop_datetime index to a cavalcade since the alphabetize will be lost during the merge.
  2. With theatmospheric condition DataFrame, you lot'll select theDATE andrating columns and put them in a new DataFrame.
Instructions
  • Reset the index of theri DataFrame.
  • Examine the head ofri to verify thatstop_datetime is now a DataFrame column, and the alphabetize is now the default integer index.
  • Create a new DataFrame namedweather_rating that contains only theDATEandrating columns from theconditions DataFrame.
  • Examine the caput ofweather_rating to verify that information technology contains the proper columns.
# Reset the alphabetize of 'ri' ri.reset_index(inplace=Truthful)  # Examine the head of 'ri' print(ri.head())  # Create a DataFrame from the 'Date' and 'rating' columns weather_rating = conditions[['DATE', 'rating']]  # Examine the head of 'weather_rating' print(weather_rating.head())
<script.py> output:             stop_datetime   stop_date stop_time driver_gender driver_race  \     0 2005-01-04 12:55:00  2005-01-04     12:55             M       White        one 2005-01-23 23:15:00  2005-01-23     23:xv             Chiliad       White        two 2005-02-17 04:fifteen:00  2005-02-17     04:15             G       White        3 2005-02-20 17:15:00  2005-02-20     17:15             M       White        4 2005-02-24 01:20:00  2005-02-24     01:xx             F       White                                 violation_raw  violation  search_conducted search_type  \     0  Equipment/Inspection Violation  Equipment             False         NaN        i                        Speeding   Speeding             Faux         NaN        2                        Speeding   Speeding             Faux         NaN        3                Telephone call for Service      Other             Fake         NaN        iv                        Speeding   Speeding             Simulated         NaN                 stop_outcome  is_arrested stop_duration  drugs_related_stop district  \     0       Citation        Fake      0-xv Min               Fake  Zone X4        ane       Commendation        Simulated      0-fifteen Min               Faux  Zone K3        two       Commendation        Faux      0-15 Min               Simulated  Zone X4        three  Arrest Commuter         True     16-30 Min               Simulated  Zone X1        4       Commendation        False      0-15 Min               Simulated  Zone X3                frisk  stop_minutes       0  Imitation             8       1  Simulated             8       2  False             8       3  Simulated            23       4  Simulated             viii                Appointment rating     0  2005-01-01    bad     1  2005-01-02    bad     2  2005-01-03    bad     3  2005-01-04    bad     4  2005-01-05    bad

 Theri andweather_rating DataFrames are now ready to be merged.

Merging the DataFrames

In this exercise, you'll merge theri andweather_rating DataFrames into a new DataFrame,ri_weather.

The DataFrames will exist joined using thestop_date column fromri and theDATE cavalcade fromweather_rating. Thankfully the date formatting matches exactly, which is not always the case!

Once the merge is complete, you'll setstop_datetime as the index, which is the column you saved in the previous do.

Instructions
  • Examine the shape of theri DataFrame.
  • Merge theri andweather_rating DataFrames using a left join.
  • Examine the shape ofri_weather to confirm that it has 2 more columns but the aforementioned number of rows asri.
  • Supervene upon the index ofri_weather with thestop_datetime column.
# Examine the shape of 'ri' print(ri.shape)  # Merge 'ri' and 'weather_rating' using a left join ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='Engagement', how='left')  # Examine the shape of 'ri_weather' print(ri_weather.shape)  # Set 'stop_datetime' every bit the index of 'ri_weather' ri_weather.set_index('stop_datetime', inplace=True)
<script.py> output:     (86536, 16)     (86536, 18)

 In the side by side department, you'll utilizeri_weather to analyze the relationship between conditions conditions and police behavior.

Comparing abort rates by weather rating

Do police officers arrest drivers more ofttimes when the weather condition is bad? Observe out below!

  • First, y'all'll calculate the overall arrest rate.
  • Then, you'll calculate the arrest charge per unit for each of the weather ratings you previously assigned.
  • Finally, you'll add violation type as a 2d factor in the analysis, to meet if that accounts for any differences in the arrest charge per unit.

Since y'all previously defined a logical guild for the weather categories,adept < bad < worse, they will exist sorted that way in the results.

Instructions 1/3
  • 1Calculate the overall arrest rate by taking the mean of theis_arrested Series.
# Calculate the overall arrest charge per unit print(ri_weather.is_arrested.mean())
<script.py> output:     0.0355690117407784              

two/3 Calculate the arrest rate for each weatherrating using a.groupby().

# Summate the abort rate for each 'rating' print(ri_weather.groupby('rating').is_arrested.mean())
<script.py> output:     rating     good     0.033715     bad      0.036261     worse    0.041667     Proper noun: is_arrested, dtype: float64

iii/3 Calculate the arrest rate for each combination ofviolation andrating. How do the arrest rates differ by grouping?

# Calculate the arrest rate for each 'violation' and 'rating' print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
<script.py> output:     violation            rating     Equipment            expert      0.059007                          bad       0.066311                          worse     0.097357     Moving violation     skillful      0.056227                          bad       0.058050                          worse     0.065860     Other                good      0.076966                          bad       0.087443                          worse     0.062893     Registration/plates  proficient      0.081574                          bad       0.098160                          worse     0.115625     Seat belt            good      0.028587                          bad       0.022493                          worse     0.000000     Speeding             practiced      0.013405                          bad       0.013314                          worse     0.016886     Proper noun: is_arrested, dtype: float64

The abort charge per unit increases as the weather gets worse, and that trend persists across many of the violation types. This doesn't prove a causal link, but information technology's quite an interesting outcome!

Selecting from a multi-indexed Series

The output of a single.groupby() performance on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:

  • The outer index level is like the DataFrame rows.
  • The inner alphabetize level is like the DataFrame columns.

In this exercise, you'll practice accessing data from a multi-indexed Series using the.loc[] accessor.

Instructions
  • Save the output of the.groupby() performance from the last exercise as a new object,arrest_rate. (This has been done for you.)
  • Impress thearrest_rate Series and examine information technology.
  • Impress the abort charge per unit for moving violations in bad weather.
  • Print the abort rates for speeding violations in all three weather weather condition.
# Save the output of the groupby performance from the last exercise arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.hateful()  # Print the 'arrest_rate' Series print(arrest_rate)  # Print the arrest rate for moving violations in bad weather condition impress(arrest_rate.loc['Moving violation', 'bad'])  # Impress the abort rates for speeding violations in all iii weather conditions impress(arrest_rate.loc['Speeding'])
<script.py> output:     violation            rating     Equipment            good      0.059007                          bad       0.066311                          worse     0.097357     Moving violation     good      0.056227                          bad       0.058050                          worse     0.065860     Other                good      0.076966                          bad       0.087443                          worse     0.062893     Registration/plates  good      0.081574                          bad       0.098160                          worse     0.115625     Seat chugalug            good      0.028587                          bad       0.022493                          worse     0.000000     Speeding             expert      0.013405                          bad       0.013314                          worse     0.016886     Proper noun: is_arrested, dtype: float64     0.05804964058049641     rating     good     0.013405     bad      0.013314     worse    0.016886     Name: is_arrested, dtype: float64

The.loc[] accessor is a powerful and flexible tool for data pick.

Reshaping the arrest rate data

In this practice, you'll get-go by reshaping thearrest_rate Serial into a DataFrame. This is a useful step when working with whatsoever multi-indexed Series, since it enables yous to access the full range of DataFrame methods.

Then, you'll create the exact same DataFrame using a pivot table. This is a dandy example of how pandas often gives you more than than one fashion to reach the same result!

Instructions
  • Unstack thearrest_rate Series to reshape it into a DataFrame.
  • Create the exact same DataFrame using a pivot tabular array! Each of the three.pivot_table() parameters should be specified as one of theri_weathercolumns.
# Unstack the 'arrest_rate' Serial into a DataFrame impress(arrest_rate.unstack())  # Create the same DataFrame using a pivot table impress(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
<script.py> output:     rating                   good       bad     worse     violation                                             Equipment            0.059007  0.066311  0.097357     Moving violation     0.056227  0.058050  0.065860     Other                0.076966  0.087443  0.062893     Registration/plates  0.081574  0.098160  0.115625     Seat belt            0.028587  0.022493  0.000000     Speeding             0.013405  0.013314  0.016886     rating                   adept       bad     worse     violation                                             Equipment            0.059007  0.066311  0.097357     Moving violation     0.056227  0.058050  0.065860     Other                0.076966  0.087443  0.062893     Registration/plates  0.081574  0.098160  0.115625     Seat belt            0.028587  0.022493  0.000000     Speeding             0.013405  0.013314  0.016886

In the future, when you need to create a DataFrame like this, you can choose whichever method makes the near sense to you lot.

griffinhards1981.blogspot.com

Source: https://www.hylkerozema.nl/2021/10/21/examining-the-dataset/

0 Response to "After Uploading the Dataset in R I Have More Rows With False"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel