Data Narrative: Tennis Major Tournaments Dataset

This repo presents forth a Data Narrative prepared as a part of a course project. The data analysis involves the use of Clustering, Classfication, and Regression using machine learning models in python's Scikit-Learn Library. The repo also includes a written report for the same as compiled in LaTex.

The Overview

The Narrative was part of the ES-114: Probability, Statistics, and Data Visualization course offered to the first-year B.Tech students during the April '23 for the first time at the Indian Institute of Technology, Gandhinagar. The Machine Learning Course ES-113: Data Centric Computing here at IIT, Gandhinagar, was conducted in coherence with this course. The Narrative allowed for the analysis of the Tennis Major Tournaments dataset using Clustering, Classification, and Regression. Several of Probabilistic and Data Analystic tools like Covariance Matrices, Correlation Factors, and Principal Components Analysis were employed in the narrative.

Libraries used and Analysis Approach

The following is the list of Python Libraries used for the narrative:

Pandas
NumPy
Scikit-Learn

sklearn.naive_bayes.GaussianNB

sklearn.preprocessing.StandardScaler

sklearn.decomposition.PCA

sklearn.cluster.KMeans

sklearn.metrics.accuracy_score

sklearn.model_selection.train_test_split

sklearn.model_selection.cross_val_score

Plotly

plotly.express.scatter()

plotly.express.box()

plotly.express.3d_scatter()

plotly.express.bar()

The main aim of the analysis was to predict the target (win: 0/1) of the newer result based upon the features given to us to work with.
To prepare our data for training, number of features had to be reduced down.
Correlation among the features was analysed and redundant features having high correlation were shot-down.
Now with the less features, the data was prepared for training on a Gaussian Naive Bayes classifier.
Accuracy of this model was tested using a 5-fold cross validation score. The model actually gave a high accuracy score of 0.9049230769230769.
Another way of unsupervised learning was the use of KMeans clustering.
Two clusters were allowed to form and the accuracy was compared with original data.
To visualize the clustering, the dimensionality of the features was reduced to 3 using Principal Component Analysis (PCA) so clusters were plotted in 3D.
Finally, there were some ways using Pandas and NumPy to filter out important statistical data/stats and the same was plotted for using Plotly graphs.

Please refer to the Colab link to view the interactive Plotly graphs.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DN3_Code.ipynb		DN3_Code.ipynb
DN3_NoteBook.ipynb		DN3_NoteBook.ipynb
Data Narrative.pdf		Data Narrative.pdf
Narrative-3 Report.pdf		Narrative-3 Report.pdf
README.md		README.md
fig1_0.png		fig1_0.png
fig1_1.png		fig1_1.png
fig1_2.png		fig1_2.png
fig2_0.png		fig2_0.png
fig2_1.png		fig2_1.png
fig2_2.png		fig2_2.png
fig3_0.png		fig3_0.png
fig3_1.png		fig3_1.png
fig3_2.png		fig3_2.png
fig3_3.png		fig3_3.png
fig5_0.png		fig5_0.png
fig5_1.png		fig5_1.png
fig6_0.png		fig6_0.png
fig6_1.png		fig6_1.png
fig7_0.png		fig7_0.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Narrative: Tennis Major Tournaments Dataset

The Overview

Libraries used and Analysis Approach

About

Releases

Packages

Languages

guntas-13/Data-Narrative_Tennis-Major-Tournaments

Folders and files

Latest commit

History

Repository files navigation

Data Narrative: Tennis Major Tournaments Dataset

The Overview

Libraries used and Analysis Approach

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages