Parkside Parenting Partnership Project Introduction

9 minute read

Project Overview and Aim

For my capstone project I am working with the Parkside Parenting Partnership, a network of parents providing a number of peer-to-peer services, primarily advice and social opportunities in Brooklyn, NY.

The group’s founder has asked me to look into the group’s membership and answer the following questions:

  • When do people typically join PPP relative to their children’s birth dates?
  • When do people typically let membership lapse?
  • What is the average length of membership
  • Is there seasonality associated with membership?
  • How have membership levels changed over the years?
  • How has the group’s reach changed over the years?
  • How can we continue to grow?

In addition to answering these questions I hope to be able to provide understanding of why people join and how they participate in the discussions. For example, there may be a correlation between longevity of membership and frequency of starting / responding to posts.

Proposed methods and models

The purely quantitative questions of when do members join or lapse and what membership patterns are in terms of growth and seasonality should be answerable from an initial data set of the organization’s 17,682 members since 2002.

The more qualitative questions will be trickier, however, as these data do not contain any demographic data. I will definitely have to dive into text analysis to get a real read on why people join and what makes them stay or leave.

For these questions I will likely be using a combination of classification models and NLP to understand things like what makes a short-term versus long-term member and whether there have been any changes in activity or options on the website over the years that might be contributing to changes in membership.

PPP’s membership has been spreading for years and I would like to see whether any demographic shifts in surrounding neighborhoods have coincided with increased membership in PPP. This could allow me to suggest areas that could be marketing targets.

Data dictionary (initial data set)

Field Definition
mem_no Member Number
address Street Address
city City
state State (2 letter code)
zip 5-digit ZIP code
joined Date member joined Parenting Network
exp_date Date membership will expire
status [‘Active’, ‘Expired’, ‘Frozen’, ‘Pending’, ‘Prospective’]
mem_type Membership duration (e.g., 1 year, 2 year, lifetime)
last_renewal_date last renewal date (if none then this is join date)
gender male / female
club_email opt-in to get club email
dup prior member rejoining (duplicate membership)
parent_status [‘No’, ‘No, but we are pregnant/adopting’, ‘Yes’]
kid_count Number of children, includes pending children = 0.5 , over 4 children = 6
kid1_bday Date of first child’s birthday
kid2_bday Date of second child’s birthday (if no second child then this is first child’s birthday)
join_reason Open response
advice_grp Interest in receiving email from groups
classifieds Interest in receiving classifieds emails
classifieds_spouse Secondary member interest in receiving classifieds
tony_kids Interest in free membership to Time Out New York Kids for a year
discovered How did the member find out about PPPP
classifieds_spouse Secondary member interest in receiving classifieds
tony_kids Interest in free membership to Time Out New York Kids for a year
discovered How did the member find out about PPP


Data Cleaning

We were told early on that data cleaning was a significant part of Data Science and this project has definitely exposed the truth in that statement.

The initial data were given to me in two csv files.

The first concentrated on member information including: contact information (phone numbers, addresses, emails) gender whether they were joining with a spouse join date membership type last renewal date opt-in for email and printed newsletter.

The second had information on the members’ children including: number of children birthdays of first and second children whether they had been a member previously reason for joining how they found out about the group enrollment in a pregnancy / baby group opt-in for advice or classifieds emails opt-in for free Time Out New York Kids subscription

Both lists included membership numbers so that was the most appropriate level to merge the two. After excluding duplicated fields I was left with the following features and had to clean as indicated.

Street address:

I decided to exclude the address field that would normally contain an apartment number altogether. One thought I’d had regarding the address was to try to calculate distance from the organization’s headquarters to each member so I was hopeful that I could clean this field up enough to get to that detail. After some initial cleaning, removing white spaces and punctuation and standardizing words like “street” and “avenue”, it became clear that the time it would take to clean the addresses to the state necessary for this would be onerous between the shorthand for street names, the inconsistency of apartment numbers being in this field, and many addresses omitting the “street” or “avenue” in a city where there are similar names for both (e.g. “102 3rd” could mean 3rd Street, 3rd Avenue, or 3rd Place). Ultimately the same analysis could be done, if not to the same level of detail, by zip code.

I might return to clean street addresses in the future, especially if I am unsatisfied with the results from the zip code analysis.

City:

There were 38 ways that Brooklyn was misspelled in this dataset. At one point I actually had to look up the spelling to make sure I had it right.

Country:

As the focus of this study was on New York I excluded any international members.

Zip codes:

One of the reasons the discussion on foreign cities emerged was the format of overseas zip codes, which are a combination of letters and numbers separated by a space. Once international membership was excluded the only issues were with four digits–the +4 extensions for zip codes and a number of zip codes with only 4 digits.

To cut off the “+4” portion of the zip code I just stripped off the last 5 digits of any entry with over 5 digits. The four-digit zip codes turned out to just be five-digit zips that were missing a leading 0.

Dates (in general):

Dates were a huge challenge for me as I’d never had to access a date as a date. My previous experiences with dates just needed years and I was able to take those as ints. Here I would really need to be able to look at dates AS dates. This was especially important as there were some very strange outliers in some fields, such as children who were born in the 2nd year CE (the year read 0002 instead of 2002) and some that were predicted to be born in the future (in 2040 rather than 2004). In order to find them I needed to be able to sort them and if they weren’t formatted as dates they would be buried in the list by month.

Last Renewal Date:

There were over 14,000 records without a listed renewal date; this presented a problem. With other fields there might be a handful of nas that I would just drop since I had a really healthy starting data set. Here it would be a clear majority. My solution was to impute the join date as a renewal date if none existed, theorizing that to be “renewed” was equivalent to someone joining again. I will also create dummy columns for this showing whether or not a member has renewed. Having more renewal data would have been useful in order to classify what drove people to renew. This might be something to look into in a future iteration of this project.

Membership level:

This field identifies a member as being primary or secondary. Basically is it the person who signed up or that person’s parenting partner (e.g., spouse). As none of the project’s objectives really deal with secondaries directly I just dropped these members as they present both noise and duplication.

Number of children:

There were some inconsistencies here, including some responses having line breaks leading into the response, three “not quite yet” responses, and a “more than four” option. It was easy enough to take out the line breaks and for the “more than four” I just imputed 6 with no basis but gut. The “not quite yet” responses were more problematic as it would be important to identify these members but not say that they had zero children. I decided to take Solomon’s path and assign these members a child count of 0.5.

There were some inconsistencies here, including some responses having line breaks leading into the response, three “not quite yet” responses, and a “more than four” option. It was easy enough to take out the line breaks and for the “more than four” I just imputed 6 with no basis but gut. The “not quite yet” responses were more problematic as it would be important to identify these members but not say that they had zero children. I decided to take Solomon’s path and assign these members a child count of 0.5.

Second child’s birthday:

Similar to the renewal date, this was a situation where I couldn’t drop nas because there were too many. I opted to impute the first child’s birthdate here. Later on I can either use a dummy of “has second child” or identify cases where I did this by comparing the two birthdays (though this would eliminate twins–will need to check how many twins are in the group).

Reason for joining:

This is an open response field which I’ll be taking out when doing most of the rest of the analysis, keeping it for an NLP analysis.

This field also presented the biggest challenge of them all. I decided to isolate the data cleanup in its own notebook and to export a csv from it to use in my analyses in other notebooks. When I looked at the file after exporting I noticed that there were several instances where a row ended early, then there was a blank line, and then the remainder of the earlier row was printed at the head of a new line.

I finally found that because the open response allowed for carriage returns the csv writer was picking those up as a signal to start a new line. Once discovered it was easy enough to run a replace for “\r”.

Additional Data

As suggested earlier, I will work with the Parkside Parenting Partnership to get access to discussion group information in order to do NLP and behavior analysis, which I hope would be able to identify topics of interest to members that the organizers may not be aware of, as well as to see whether there is information to be gleaned there regarding membership renewal / longevity.


Additionally I will try to get information about the areas where membership has increased in the past and see if there are demographic trends that might suggest where the group’s next surge in membership might come from.

I might also try to find other similar groups and find out their membership size and geographic reach. It may be that there is a boundary past which the distance is too great to expect new members to join.


Additional posts on this project can be found at:

Parkside Parenting Partnership Data Analysis
Parkside Parenting Partnership Model and Clustering

Updated: