Fast Food Nutritional Information

Understand

This dataset displays the resturant and the items that they serve. For each item within the resturant, the dataset displays

  • Calories
  • Calories from fat
  • Total Fat
  • Saturated Fat
  • Sodium
  • Protein
  • Etc.

The questions I have with this data is

  • Which fast food restaurant is the best and worst in terms of health
  • Which items are the either the healthiest or unhealthiest (most calories)

I plan on using classification to predict if a fast food item is deemed healthy or not (so we are using a binary 0-1 response)

Prepare

Perform EDA: Exploratory Data Analysis that would help you address some of your questions above. This includes tables, numerical summaries and graphs. Finally, split your data into a training set and a test set and remove variables that are not informative or redundant.

Use Pandas, seaborn, plotly, matplotlib and Scikit learn to implement this step

import seaborn as snsimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport warningswarnings.filterwarnings("ignore")fastfood = pd.read_csv('/content/fast food.csv')#delete the salad column as redundantdel fastfood['salad']fastfood.head()
#There are missing values coded as NA in the dataset. Therefore I will fill each missing value with the mean of the columnfastfood=fastfood.fillna(value=fastfood.mean())fastfood.tail()
fastfood.describe()#Shows how many items are being served by resatruantfastfood.groupby('restaurant')['restaurant'].count()
restaurantArbys           55Burger King     70Chick Fil-A     27Dairy Queen     42Mcdonalds       57Sonic           53Subway          96Taco Bell      115Name: restaurant, dtype: int64
import plotly.express as pxdf = fastfoodfig = px.scatter(df, x="calories", y="total_fat", color="restaurant", symbol="restaurant")fig.show()'''groups = fastfood.groupby('restaurant')for name, group in groups:    plt.plot(group.calories, group.total_fat, marker='o', linestyle='', markersize=9, label=name)plt.legend()'''

Graph 1

Based on the graphs shown above and below this text-box, I realized that any classification algorithm (from the understand phase) likely won't work for this type of data. Because you can see that every restuarant, and restaurant item, is so closely related and lumped together (with minimal clustering) which makes classification difficult. Therefore, since the data shows an upward increasing trend, I believe the linear regression model is the best suited on for this scenario

df2 = fastfoodfig2 = px.scatter(df2, x="calories", y="total_fat", color="item", symbol="restaurant")fig2.show()

Graph 2

#Delete the restaurant and item columns as they won't work when splitting the data for predictiondel fastfood['restaurant']del fastfood['item']
#Split data into train and test sety=fastfood.calories #Predictor variablex=fastfood.drop('calories',axis=1)from sklearn.model_selection import train_test_split#20 percent testing data, 80% trainingx_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2)

Analyze

Identify an algorithm appropriate for your analysis. Choices include:

Regression: Linear Regression, K-Nearest Neighbors Regression, Elastic Net (Lasso/Ridge Regression), Neural Networks

Classification: Logistic Regression, K-Nearest Neighbors Classification, Random Forest, Neural Networks

Use statsmodels and Scikit learn to implement this step

https://www.freecodecamp.org/news/how-to-build-and-train-linear-and-logistic-regression-ml-models-in-python/

#Building and training the model. We will be using linear regression as this is appropriate for predicting continous variable: caloriesfrom sklearn.linear_model import LinearRegressionmodel=LinearRegression()model.fit(x_train, y_train)
pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])
---Coeff
Unnamed:0 0.000510
cal_fat0.834085
total_fat0.952933
sat_fat0.604979
trans_fat2.320083
cholesterol-0.010765
sodium0.009695
total_carb3.928657
fiber0.030934
sugar0.108872
protein3.785856
vit_a-0.063226
vit_c0.005903
calcium-0.037546
#making Predictions with the modelpredictions = model.predict(x_test)plt.scatter(y_test, predictions)

Graph 3

plt.hist(y_test - predictions)
#Testing Model Performance using Mean absolute error, mean squared error, root mean squared errorfrom sklearn import metrics#8.60, the absolute average difference b/t predicticted and actual values and to assess effectiveness of regression modelmetrics.mean_absolute_error(y_test, predictions)#In that case, on average, the predicted values are off by about 8.6 calories from the true values, not too bad#150.48, this number is quite high but should be ok since MSE is usually highmetrics.mean_squared_error(y_test, predictions)#12.27, is the sqrt of MSE between true values and predicted values. RSME is a measure of the average#deviation of predicted values from actual values, and a lower value indicates a better fit of the regression to the datanp.sqrt(metrics.mean_squared_error(y_test, predictions))

Deploy

I think the model did fine with its predictions and such. I have never done regression on Python so interpreting the outputs might be little more different than doing so on R. But from the results as well as the graph, I can say that there is a positive coorellation between Calories/total-fat (our dependent variable, and vise versa) to the other independent variables (saturated fats, sodium, vitamins, etc..)