« R holds top ranking in KDnuggets software poll | Main | Gender ratios of programmers, by language »

June 14, 2016


Feed You can follow this conversation by subscribing to the comment feed for this post.

Your results are an artifact. As the data dictionary for the taxi data set states, the tip_amount field is populated for credit card transactions but cash tips are not included.

Really good work processing these files. I wonder if what you've found is the likelihood of a driver declaring a tip, rather than a tip being left? i.e. if a driver gets a card payment it's hard to hide a tip, but a cash tip can go straight into their pocket and doesn't need to be declared. Not knowing taxi management structures or USA tax schemes, I have no idea if this is something they'd want to do!

Thanks for your comments!

Michael: we do see 0.0005490915 * 289816908 = 159136 cash tips in the dataset. It is quite possible that many cash tips were not recorded in this dataset.

Mike: great observation about the declaration of cash tips. Unfortunately we don't have data to compute the likelihood of declaring a cash tip.

It's nice demonstartion of handeling large files, but do you really need that amount of data to fit a logistic regression model. Would be interesting to see if you took a fraction of the data and rebuild the model an see if the resultaten are different.

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr