i have to discover the many predictive keywords and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages therefore we may use them to ascertain which adverts should populate for each web page. Because this is a category problem, we’ll use Logistic Regression & http://www.online-loan.org/payday-loans-sd/ Bayes models. Misclassifications in this full situation is fairly benign therefore I will make use of the precision rating and set up a baseline of 63.3per cent to price success. Utilizing TFiDfVectorization, I’ll get the function value to ascertain which terms have actually the greatest forecast energy for the prospective factors. If effective, this model is also utilized to a target other pages which have comparable regularity regarding the words that are same expressions.
See dating-advice-scrape and relationship-advice-scrape notebooks with this component.
After switching most of the scrapes into DataFrames, we conserved them as csvs that you can get within the dataset folder of the repo.
Information Cleaning and EDA
- dropped rows with null self text line becuase those rows are worthless for me.
- combined title and selftext column directly into one brand brand new columns that are all_text
- exambined distributions of term counts for titles and selftext column per post and contrasted the 2 subreddit pages.
Preprocessing and Modeling
Found the baseline precision score 0.633 which means that if i select the value that develops frequently, i will be appropriate 63.3% of times.
First attempt: logistic regression model with default CountVectorizer paramaters. train score: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first group of scraping, pretty bad rating with a high variance. Train 99%, test 72%
- attempted to decrease maximum features and rating got a whole lot worse
- tried with lemmatizer preprocessing instead and test score went as much as 74percent
Merely enhancing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and cross val to 82.3 Nonetheless, these rating disappeared.
I believe Tfidf worked top to decrease my overfitting due to variance issue because
we customized the end terms to just take the ones away which were really too regular to be predictive. It was a success, nevertheless, with an increase of time we most likely could’ve tweaked them a little more to boost all ratings. Taking a look at both the single terms and words in categories of two (bigrams) ended up being the most readily useful param that gridsearch advised, but, every one of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term had been needed to show as much as 2, helped be rid of the. Gridsearch additionally advised 90% max df rate which assisted to remove oversaturated terms also. Lastly, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they had been to just focus the absolute most frequently employed terms of the thing that was kept.
Summary and Recommendations
Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores
thus I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. It was found by me interesting that taking out fully the overly used terms assisted with overfitting, but brought the precision rating down. I believe there is certainly probably nevertheless room to relax and play around with the paramaters associated with Tfidf Vectorizer to see if various end terms create an or that is different
Used Reddit’s API, needs library, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union information, and trained a binary category model to anticipate which subreddit confirmed post originated in