There are many GitHub repositories and some are more popular than others. A Git repository tracks and saves the history of all changes made to the files in a Git project. It saves this data in a directory called . git, also known as the repository folder. Git uses a version control system to track all changes made to the project and save them in the repository. In this project, we will be scraping GitHub repositories and organizing the information in a Pandas data frame. After that, we will use linear regressions to gain meaningful insights into the data we collected. Our goal is to identify properties that make a repository popular which will give insights into future uses of repositories.
We will be scraping data from this repository list.
Before we can start to scrape any website we should go through the terms of service and policy documents of the website. Almost all websites post conditions to use their data. Here are the terms of https://github.com/. In our case, we are allowed to scrape the repository, but all use of GitHub data gathered through scraping must comply with the GitHub Privacy Statement.
We avoided any problems with GitHub blocking us from downloading the data by saving all the HTML files in the data folder. The path to the data folder is stored in the DATA_PATH
variable. Additionally, the data folder contains highly starred repositories saved as:
searchPage1.html
, searchPage2.html
, searchPage3.html
... searchPage10.html
Python Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(DATA_PATH + "/searchPage1.html"), "html.parser")
for i in range(2, 10):
html = (DATA_PATH + "/searchPage" + str(i) + ".html")
soup.append(BeautifulSoup(open(html, encoding = "utf8"), "html.parser"))
We extracted the following data:
In our preliminary analysis, we will visualize the data with a correlation and scatterplot matrix.
stars | watches | forks | commits | branches | contributors | issues | readme | |
---|---|---|---|---|---|---|---|---|
stars | 1.000000 | 0.718511 | 0.467899 | 0.050483 | -0.030127 | 0.195937 | -0.022076 | -0.068224 |
watches | 0.718511 | 1.000000 | 0.713677 | 0.249784 | -0.034666 | 0.334194 | -0.038269 | -0.025445 |
forks | 0.467899 | 0.713677 | 1.000000 | 0.171526 | -0.014534 | 0.263581 | 0.049471 | -0.148933 |
commits | 0.050483 | 0.249784 | 0.171526 | 1.000000 | 0.077237 | 0.933173 | 0.061822 | -0.086813 |
branches | -0.030127 | -0.034666 | -0.014534 | 0.077237 | 1.000000 | 0.026588 | 0.438276 | -0.143878 |
contributors | 0.195937 | 0.334194 | 0.263581 | 0.933173 | 0.026588 | 1.000000 | 0.070621 | -0.052528 |
issues | -0.022076 | -0.038269 | 0.049471 | 0.061822 | 0.438276 | 0.070621 | 1.000000 | -0.139192 |
readme | -0.068224 | -0.025445 | -0.148933 | -0.086813 | -0.143878 | -0.052528 | -0.139192 | 1.000000 |
Python Code:
import seaborn as sns
rel = sns.pairplot(project_info, diag_kind="kde")
rel.fig.subplots_adjust(top=.95)
rel.fig.suptitle('Scatterplot Matrix', fontsize = 20)
From these charts, we can see that there is a positive correlation between commits and contributions with the correlation being 0.933. Additionally, there is a positive correlation between the number of forks and the number of watches with the correlation being 0.71. This number is approximately similar to the correlation between the number of watches and the number of stars. All other variables don't have much of a correlation with one another.
Now we will use linear regression to try to predict the number of Stars based on Forks, Contributors, Issues, and README Length.
Python Code:
uni_mod = sm.ols(formula="stars ~ forks + contributors + issues + readme", data = project_info)
uni_result = uni_mod.fit()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: stars R-squared: 0.227
Model: OLS Adj. R-squared: 0.191
Method: Least Squares F-statistic: 6.244
Date: Fri, 15 Jul 2022 Prob (F-statistic): 0.000187
Time: 14:41:51 Log-Likelihood: -1071.6
No. Observations: 90 AIC: 2153.
Df Residuals: 85 BIC: 2166.
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 4.738e+04 7345.923 6.449 0.000 3.28e+04 6.2e+04
forks 1.4623 0.326 4.491 0.000 0.815 2.110
contributors 1.8914 2.310 0.819 0.415 -2.701 6.483
issues -1.6891 3.223 -0.524 0.602 -8.097 4.719
readme -0.0058 0.134 -0.043 0.965 -0.272 0.260
================================================================================
Omnibus: 104.465 Durbin-Watson: 0.442
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1509.964
Skew: 3.814 Prob(JB): 0.00
Kurtosis: 21.560 Cond. No. 6.49e+04
================================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.49e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
We will also use linear regression to try to predict the number of Stars based on Forks, Contributors, Watches, Commits, and README Length.
Python Code:
uni_mod = sm.ols(formula="stars ~ forks + contributors + watches + commits + readme", data = project_info)
uni_result = uni_mod.fit()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: stars R-squared: 0.598
Model: OLS Adj. R-squared: 0.574
Method: Least Squares F-statistic: 24.94
Date: Fri, 15 Jul 2022 Prob (F-statistic): 2.54e-15
Time: 14:00:23 Log-Likelihood: -1042.3
No. Observations: 90 AIC: 2097.
Df Residuals: 84 BIC: 2112.
Df Model: 5
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 2.285e+04 5718.815 3.996 0.000 1.15e+04 3.42e+04
forks -0.5362 0.332 -1.616 0.110 -1.196 0.124
contributors 15.9343 4.764 3.345 0.001 6.461 25.407
watches 17.7592 2.275 7.805 0.000 13.234 22.284
commits -0.3285 0.085 -3.848 0.000 -0.498 -0.159
readme -0.1416 0.098 -1.448 0.151 -0.336 0.053
==============================================================================
Omnibus: 72.640 Durbin-Watson: 1.149
Prob(Omnibus): 0.000 Jarque-Bera (JB): 634.454
Skew: 2.397 Prob(JB): 1.70e-138
Kurtosis: 15.092 Cond. No. 1.98e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.98e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
From our regressions, the first model is not a good model because it has a low R-squared of 0.227, the p-values for the variables show that they are insignificant, and the prob f-statistic is small. A low R-squared indicates that the independent variable is not explaining much of the variation of the dependent variable. The variable forks are significant at the 5% level, but all other variables are insignificant. Thus overall, the first model is not a good model. The second model is better because the R-squared is higher than the first at 0.598 so around 60% of the variation is explained. Additionally, the prob f-statistic is smaller than the first model. Furthermore, commits, watches, and contributors are significant at the 5% level. Overall, the second model is better as a result of these differences.
The data from the GitHub repository is always changing and we did not program to actively take in data. Instead, we saved the HTML files to access the data at a certain point in time. Although this is a good enough sample for this analysis, it could potentially be more accurate if there was a program taking in live data from the GitHub repository. It was also difficult and time-consuming to scrape the data through the HTML files. We overcame this through for loops to salvage the code.
As stated above, our goal was to identify properties that make a repository popular which will give insights into future uses of repositories. Our findings conclude that commits, watches, and contributors in repositories are statistically significant. In other words, as the number of commits increases by 1, the number of stars decreases by 0.33 percentage points. Moreover, as the number of watches increases by 1, the number of stars increases by 17.76 percentage points. Finally, as the number of contributors increases by 1, the number of stars increases by 15.93 percentage points. Forks and README Length are both statistically insignificant.