Predicting Citation Counts in Python and Stata

This code package provides a pre-trained ridge regression model which is able to predict citations of papers based on paper titles. The model was trained based on data from the Clarivate Web of Science for the years 1900 until 2010. The approach is based on the paper “Gender Gaps in Academia: Global Evidence Over the Twentieth Century” by Alessandro Iaria, Carlo Schwarz, and Fabian Waldinger (2022). See Appendix D of the paper for further details on the ridge regression model.

In contrast, to the original paper, the model predicts the log number of citations and does not use a separate model by subject and year. Nonetheless, the model performs very well when it comes to the prediction of citation counts and achieves an out-of-sample R2-scores of above 0.2.

“Estimating Text Regressions using txtreg_train” by Schwarz (2022) additionally provides a complete set of Stata commands to estimate text regressions. Note that for the classifier to work in Stata, you require Stata 16 as well as a working Python installation (e.g., Anaconda). This makes it possible to also use the pre-trained model in Stata. The model and code are provided for public use without any warranty. Please cite Iaria, Schwarz, Waldinger (2022) and Schwarz (2022) if you use the code. The pre-trained model and the Stata packages can be downloaded here:

[Code Files][Predicted Citation Model]

To facilitate the use of the pre-trained model, I provide some code examples for both Python and Stata below:

Code Example Python:

import pandas as pd
from nltk.stem import SnowballStemmer as sns
import pickle

X = pd.read_stata(C:/Users/WoS_all_cite_prediction_example.dta)
X = X.title.tolist()

model_path  = "C:/Users/Models/predicted_citation_all.pkl"

print("Loading Model: {}".format(model_path) )
with open( model_path  , 'rb') as f:
        cv = pickle.load(f)
        model = pickle.load(f) 

stemmer = sns("english")

def stem(string):
        "splits and stemms a string variable and removes stopwords"
        stems = [stemmer.stem(word) for word in string.lower().split()]
        return " ".join(stems)

print("Stemming text")
X = X.apply(stem)

# predict citations for stemmed strings
prediction = model.predict(cv.transform(X)) 

Code Example Stata:

version 16
clear all 

global path = "C:/Users/"

* load example data
use "$path/WoS_all_cite_prediction_example.dta"

* predict citations
textreg_predict title using "$path/Models/predicted_citation_all.pkl" , name_new_var("predicted_citation") stem stem_Lang("english")

Published Stata Commands

ldagibbs: A command for Topic Modeling in Stata using Latent Dirichlet Allocation
[Stata Journal 18(1), pp. 101–117, 2018.] [Code Files]

This paper introduces the ldagibbs command which implements Latent Dirichlet Allocation in Stata. Latent Dirichlet Allocation is the most popular machine learning topic model. Topic models automatically cluster text documents into a user chosen number of topics. Latent Dirichlet Allocation represents each document as a probability distribution over topics, and each topic as a probability distribution over words. Thereby, Latent Dirichlet Allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

lsemantica: A Stata Command for Text Similarity based on Latent Semantic Analysis 
[Stata Journal 19(1), pp. 129–142, 2019.] [Code Files]

The lsemantica command, presented in this paper, implements Latent Semantic Analysis in Stata. Latent Semantic Analysis is a machine learning algorithm for word and text similarity comparison. Latent Semantic Analysis uses Truncated Singular Value Decomposition to derive the hidden semantic relationships between words and texts. lsemantica provides a simple command for Latent Semantic Analysis as well as complementary commands for text similarity comparison.

Stata Utility Functions

The following Stata commands were written to avoid having to open Excel or CSV files just to merge or append them. The commands are provided without a helpfile, but they combine the syntax of import delimited/excel and merge. The only major difference is that the merge level (e.g. 1:1) has to be specified in the “how” option. The code below shows code examples for the 4 commands.

Code Examples:

[Do Files]

merge_csv id using "file_path" , how("1:1") bindquote("strict") varnames(1) case("preserve")  encoding("utf8") keepusing(varlist) nogenerate

merge_excel id using "file_path", how("m:1")  sheet("sheetname") firstrow 

append_csv using "file_path", bindquote("strict") varnames(1) case("preserve")  encoding("utf8") force

append_excel using "file_path",  sheet("sheetname") firstrow