Photo by Dean Bennett on Unsplash

A tutorial with full code to demonstrate how to predict NBA’s next MVP using machine learning techniques started from scratch, showing all steps including data collection, cleaning/merging, statistical modeling and model training.

Full Code

Table of Contents

1. Data Collection

The data comes from NBA’s official website, they’ve build a comprehensive database on all kinds of tabular data like the player’s career stats, game’s box scores, team’s performance, etc. In this article, I use a popular open source python package called nba-api to extract data from the database, the package has implemented the…


Photo by Loverna Journey on Unsplash

At Social Impact Analytics Institute, we do a lot of text extraction on PDF files. Problems arise when the PDF files are scanned documents, because that means general extraction libraries like Pdfminer, PyPDF2, or PyMuPDF are not able to extract text correctly. In order to read the files, make the text content of the files searchable, and be able to do further NLP data analysis, an OCR process must be used.

In this article, I’m going to demonstrate how to use an open source OCR engine (Optical Character Recognition) called Tesseract and its Python APIs to conduct text extraction and…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store