Is there a relationship between popularity of a given technology on Stack Overflow (SO) and Hacker News (HN)? And a few words about causality
dgwozdz
3rd JUNE 2018
Last update: 08th JULY 2018
Stack Overflow and Hacker News are portals mainly (but not only) read and used by programmers and other people who occupy their (professional or free) time with writing code.
Stack Overflow lets their users easily Stack Overflow (SO), an established in 2008 portal on which programmers help each other by asking and answering coding questions, lets their users easily find questions related to a certain programming language/framework/library etc. by tags. The questions and replies/comments are evaluated in a form of points so it is usually instantly obvious which answer was rated the highest (and therefore is considered as the best one by the community) or whether a described problem is reproducible, i.e. you can replicate it with a piece of code prepared by a person asking a question.
Hacker News (HN) is an established in 2007 portal on which users submit interesting links/stories. Those stories gather points just like questions on SO (however, users cannot downvote stories until they reach a certain karma treshold). Each post can be commmented on.
Does one site influence the other? They could but the factors behind it – the causes would rather be external. Let’s see a definition of causality:
Causality (also referred to as causation, or cause and effect) is what connects one process (the cause) with another process or state (the effect), where the first is partly responsible for the second, and the second is partly dependent on the first. In general, a process has many causes, which are said to be causal factors for it, and all lie in its past.1
Causality is a phenomenon which people intuitively understand but which is tough to measure with statistical methods. Let’s say you would like to build a model which explains the behaviour of defaults on mortgages. You may use the following data explaining a lack of repayments: clients’ incomes, GDP, clients’ heights, their genders etc. You employ a regression/decision tree/neural network and it seems that the best predictors are sex and height. Does that mean that those variables influence defaults? It may just be a spurious correlation. Those variables may include other information, not directly implied by themselves (e.g. that there is a gap in salaries between genders). You may have not considered other important variables (e.g. number of kids, length of employment, the history of previous loans if it’s available), so the data you put in your model are statistically significant. Or, because such a situation is also possible, the significant variables do indeed influence defaults but you just don’t understand the dependence of those phenomena. The causality seems to be a tough thing to identify.
In the context of HO and SN, the factors influencing both sites could be, for example, the following:
This article examines data from SO and HN trying to answer the question whether an intuitive relationship between the two is reflected in available data. It is a combination of Exploratory Data Analysis (EDA) with some descriptive statistics. It does not try to further investigate or quantify the above mentioned factors influencing both Stack Overflow and Hacker News.
When dealing with the problem of relationship, an operationalization of variables to investigate is needed. Here, the phenomenon which was examined was described as “popularity of a given technology”. What is it and how to measure it in the context of SO and HN? Popularity could be defined as liking or attraction to a certain person, an idea or, in our case, a technology. In the context of this analysis (SO and HN), you can name at least a few metrics by which you could measure whether something is popular or not. Additionally, they should be measured in a certain time unit, e.g. daily. If data that we would like to analyse can be identified by an ordered time index (the unit is irrelevant) that means that we deal with a time series.
I suggest starting the analysis with available empirical variables such as a number of questions (SO) / posts (HN) for a given programming language and points gathered by those questions (SO) / topics (HN). Those two are probably the most universal ones. Of course, you could come up with much more variables, for example:
However, there is a small problem with the available variable c): the number of views seems to be irreproducible. SO shows only the number of views a given question gathered by today’s date, so obtaining this variable from different time points (e.g. from 1st June 2018 and 2nd June 2018) results in different values. DUe to this fact, the variable, althought interesting, migth lead to irreproducible results.
When it comes to obtaining data from Satck Oveflow, it was easy to identyify which question is assigned to which technology. Every question has its tags. The process of data preprocessing was a bit tougher in case of Hacker News. The definition of a topic related to a certain technology is a topic, in which the name of this technology appears either in a title or in the text (comments were not taken into consideration).
The selected variables: number of questions/topics/point were usually analysed in four pairs:
I have previously written that all variables should be measured in a certain time unit, which leads to the next issue: how to aggregate data from a certain period? I decided to use a sum as an aggregation function, e.g. sum of questions which appeared in a certain day. You could come up with, for example, an average. However, the problem with such a metric could be small samples on the basis of which it would be computed (for example, a mean number of points gathered by questions on SO from a given day when during that 24 hours only one or two posts popped up), which would be unrepresentative.
How to cope with the problem of validating a relatiship between two phenomena? The first approach could be an EDA - Exploratory Data Analysis, which basically means producing some plots and trying to indicate something from them. The plus of this solution is the visual aspect: you can clearly see the trend of (or a lack of thereof) of a popularity for a given programming language and for most people it is easier to read plots than just bare tables. The somewhat hindering side of the method to unravel causality is its qualitative character - there is no statistic/test indicating whether your conclusions on the basis of plots are correct or not.
The second approach a qualitative one: a Granger causality. Wait a minute, you may ask, there’s a specific type of causality? Basically saying, yup. Granger causality, proposed in 1969, determines whether one time series is helpful in forecasting another time series. Therefore, this type of causality is called a predictive causality.
Note that the question: Does one phenomenon is a cause of another one? is different from what Granger causality measures: here you only use past values of a given variable and try to use them to forecast the future values of another phenomenon, just like building a forecasting model. That means that Granger causality is not and does not indicate causality between two phenomena. Nonetheless, it may indicate a relationship either resulting from the third factor influencing the two observed ones or that one variable is really an effect of the other. It is nonetheless impossible to identify it on the basis of the Granger causality test itself.
Data from this analysis, comes from two sources: (quite obviously) Stack Overflow and Kaggle (surprise!). Kaggle is a website which organizes competitons for data scientists/analysts which goal is to build the best predictive model for a given phenomenom based on shared data set(s). It also provides some data sets which are not strictly for competitions but can be used in EDA (Exploratory Data Analysis). The data utilised in this analysis with regard to Hacker News comes from the latter. Variables regarding Stack Overflow comes from queries utilized in Stack Exchange Data Explorer which allows anyone interested to write SQL queries for Stack Overflow as well as other Stack databases. The data were gathered from the period 15th September 2008 - 31st December 2017.
The programming languages or technologies which were examined include: C, C++, C#, Cobol, CSS, D3.js, R, Delphi, Fortran, Hadoop, HTML, Java, Javascript, JQuery, Pascal, Perl, Python, PHP, Ruby, Rust, Scala, Shell, Spark, SQL, Swift, Tensorflow, VBA. The choice of technologies was arbitrary.
The data from portals were assigned specific colors: grey for Stack Overflow, orange for Hacker News. Those colours are consistent with the ones used later on so that it would be easier to identify the source of data (Stack Overflow or Hacker News).
One of the ideas with regard to examining causality included checking cumulative plots. Cumulative plots show aggregated value of a given measure to a given date. For example, if we have data in such a form:
Date | Value |
---|---|
2018-01-01 | 1 |
2018-01-02 | 2 |
2018-01-03 | 3 |
Then the cumulative value would be sum of all the values up to a given date:
Date | Value | Cumulative value |
---|---|---|
2018-01-01 | 1 | 1 |
2018-01-02 | 2 | 3 |
2018-01-03 | 3 | 6 |
Exemplifying, the plot below shows:
It can be noticed that by the end of 2017 the cumulative number of questions on SO exceeded 50 thousand while in the same time the number of points for topics with C# on HN reached about 30 thousand.
Above described plot for C# does not seem particularly interesting. It shows (rather obviously) an upward trends for both variables, however, the dynamic for them is different. What is more, it would be nice to see a standardized variables, for example in such a way that they both start at 0 and end at 1. Thanks to such a data transformation technique it is possible to find time series which may be similar in terms of behaviour in time but different when it comes to differences by which they increase (or decrease). The plot for the same phenomena and technology as above but with standardized variables is presented below:
Now it can be noticed that the cumulative number of questions on SO and the cumulative number of points for topics on HN show strong resemblance. Similar resemblance can be seen when comparing standardized plots of cumulative number of questions vs. cumulative number of topics:
Let’s see some interesting similarities between statistics on SO and HN for different technologies on standardized plots. We will only see the technologies for which I identified somme sort of similarity between data from SO and HN or for which I discovered something interesting. Additionally, plots on the left will be the ones for standardised variables while those on the right for variables without transformation (standardisation).
Similarly to C#, there is visible resemblance between cumulative number of questions on SO and cumulative number of points for topics on HN:
as well as in case of cumulative number of questions on SO and cumulative number of topics on HN:
Not surprisingly, in case of C++ the similarities existing for C and C# are repeated.
Yes, Cobol. I wanted to say (write) good, old Cobol, however, I find such statement to be a little exaggeration for this, I would say antique programming language. If you don’t know what Cobol is, its name is an acronym from common business-oriented language, which resembles high similarity to English language. The aim of such a design was to be readable for both programmers and non-technical staff like managers. It was introduced in 1959(!) and was/is used in variety of environments, including banking and insurance.
The plot below shows a high resemblance of cumulative number of questions on SO and cumulative number of topics with regard to this programming language on HN.
It is worth noticing that in a span of about 9 years less than 300 questions appeared on SO, which shows low or even lack of popularity of this technology nowadays.
Similar resemblance is observed between the cumulative number of questions on SO and the cumulative number of points gathered by the topics on HN.
.and between the cumulative number of questions on SO and cumulative number of topics on HN:
In case of Javascript visualization library D3.js a resemblance is observed between the cumulative number of points obtained by questions on SO and the cumulative number of points gathered by topics on HN:
For Delphi the cumulative number of questions on SO seems to follow the same trend as the cumulative number of topics on Hacker News:
When it comes to Fortran, the similarity is observed between the cumulative number of questions on SO and the cumulative number of topics on HN.
Similarly to Cobol, this programming gathered only 700 question in over 9 years which indicates its unpopularity.
In case of Hadoop, the cumulative number of questions on SO seems to be similar to the cumulative number of points on HN. What’s interesting here is the change of dynamic in 2013: since the middle of this year the number of question on SO grows faster than the number of points on HN.
In case of HTML there was no resemblance between variables. Nevertheless, the interesting fact is that since 2014 the number of point for questions on SO stabilizes and later slightly decreases by about 5%, which is shown on the plot below:
Such a situation occurred due to greater number of downvotes than upvotes. This may be result of high number of duplicates since 2014 or the questions which were not formulated in a clear way or were not reproducible (and therefore were downvoted).
In case of Java the resemblance is visible between the cumulative number of questions on SO and the cumulative number of topics on HN:
Like in the case of HTML, here the cumulative number of pointsalso levels off in 2014 to fall slightly by the end of 2015. The significant drop is observed by the end of 2017.
In case of Javascript the similarity is observed between the cumulative number of questions and cumulative number of topics on HN…
…as well as between the cumulative number of questions on SO and the cumulative number of points between questions on HN:
In case of Pascal the cumulative number of questions on SO tends to resemble the cumulative number of points for topics on HN:
However, in case of this programming language an analogous to HTML and Java situation is visible: the cumulative number of points achieves its peak between 2013 and 2014. In the later years it plummets to be reduced by half by the end of 2017.
The cumulative number of questions on SO seems to resemble the cumulative number of points on HN…
…and the cumulative number of topics on Hacker News related to Perl.
In case of PHP only the drop of the cumulative number of points of questions on SO is visible:
In case of Python the visible similarity appears between the cumulative number of questions on SO and the cumulative number of points for topics on HN…
…as well as for the cumulative number of questions on SO and the cumulative number of topics on HN:
The visible similarity appears between the cumulative number of questions on SO and the cumulative number of points for topics on HN.
.as well as between the cumulative number of questions on SO and the cumulative number of topics on HN:
The same situation: the similarity can be identified between the cumulative number of questions on SO and:
In case of this technology, the variables characterized by similar trends are the cumulative number of questions on SO and the cumulative number of topics on HN:
In addition to that, this technology is also characterized by the decrease of cumulative number of points since 2014 by about 20 percent.
In case of swift there also seems to be a similarity between the cumulative number of questions on SO and:
In case of Tensorflow, the highest visible resemblance was identified between the cumulative number of questions on SO and the cumulative number of topics on HN:
There are a couple of problems with above given visual identifications:
Granger causality informs whether one variable is useful in forecasting another variable. The test takes a form of performing two linear regressions2:
where:
where: - potential cause in period (t-i), - coefficient for i-th autoregressor (i-th past value) of potential cause, - error term.
The null hypothesis is that all the added past values of the potential cause are equal to 0:
The computed statistic takes such a form:
where:
- number of observations (elements in time series)
Let’s try to interpret it.
Now we know what we want to use, however, we don’t know if we can use it. Let me just remind that we’re dealing with time series. One of time series properties is that they can be nonstationary i.e. the realizations of the examined process may not come from the same distribution over time. Using nonstationary time series in linear regression may lead to so called spurious regression in which the modelled dependencies between a regressand and regressors are really nonexistent. In order to avoid such a situation, the times series should be transformed to stationary ones a priori. It is obtained by differencing.
How to check whether a given time series is (non)stationary? One of possible solutions for this task is employing an Augmented Dickey-Fuller test. Its null hypothesis is that the series is nonstationary. There are two versions of the test utilized in this article:
Detection whether a time series includes a trend is performed by regressing it against the time (time series from 1 to n, where n is the number of observations in the examined time series) and checking whether the estimated coefficient by the trend is statistically significant for the determined significance level (in this case = 0.05). If the coefficient is statistically significant, the ADF test with constant and trend is used, otherwise only with a constant.
After each time series is transformed (if necessary) to a stationary one, a Granger causality test is performed. Here an important question should be asked: how many autoregressors should be considered? It was pointed out that this number should be dependent on minimum value of information criteria: in case of this article, Akaike and Bayes. It was remarked that the former tends to choose the overfitted while the latter the underfitted model, so neither of them is perfect. Nevertheless, the analysis indicates that they often suggest the same lag. Maximum considered lag was 36 periods (3 years).
The ADF test, differencing and the Granger causality test were performed on data aggregated to monthly frequency in order to reduce the daily noise in number of questions/points for examined variables. In order to check if Hacker News does not influence popularity of technology on Stack Overflow, the opposite hypotheses were also examined. Statistically significant results when is rejected (for significance level = 0.05) are presented in the table below:
No | Technology | Regressand (Y) | Regressor (X; potential cause) | AIC | BIC | max lag | min AIC | min BIC | p.value | Differentiations of Y | Differentiations of X |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | swift | Number of posts on HN | Number of points on SO | 188,29 | 226,13 | 13 | True | True | 0,01 | 0 | 0 |
2 | css | Number of posts on HN | Number of points on SO | 123,18 | 292,36 | 36 | True | True | 0 | 1 | 0 |
3 | html | Number of points on HN | Number of questions on SO | 822,3 | 991,48 | 36 | True | True | 0,02 | 1 | 1 |
4 | java | Number of posts on HN | Number of points on SO | 43,06 | 211,26 | 36 | True | True | 0,03 | 1 | 2 |
5 | jQuery | Number of points on SO | Number of points on HN | 269,49 | 438,67 | 36 | True | True | 0 | 1 | 1 |
6 | jQuery | Number of posts on HN | Number of points on SO | 95,03 | 264,2 | 36 | True | True | 0,01 | 1 | 1 |
7 | tensorflow | Number of questions on SO | Number of points on HN | 143,91 | 155,48 | 6 | True | True | 0,04 | 2 | 1 |
Explanations of consecutive columns:
The table with results of Granger causality test suggests a few things:
It’s high time to ask the most important question: do the data confirm the relationship between the popularity of technology on Stack Overflow causes the popularity of technology on Hacker News?
The question seems to be clear. However, answering it seems to be not an easy job. Nevertheless, let’s try.
As a side-effect of this side-project (sic!) I find it interesting that there are technologies (HTML, Java, SQL, Pascal, PHP) in case of which the number of negatively rated questions high (or the number of negative points for some questions) was so high that the cumulative number of points for those technologies dropped (even by a half from its maximum).
To sum up: can popularity of technology on Stack Overflow (SO) be related to popularity of post about this technology on Hacker News (HN)? Data seem to reflect some kind of relationship between those two portals, although it is worth mentioning that the conducted here Exploratory Data Analysis is (to some extent) subjective therefore such a conclusion cannot be treated as set in stone. This article definitively does not exhaust the topic.
Further research in this topic may include:
I would like to thank Piotr Migdal for the suggestion of this topic, his mentoring, research ideas and the overall supervision of this analysis. Also, many thanks to Jan Bendarz for suggesting a method of trend detection, Andrzej Drzystek for thoughts on an initial draft and Maciej Pomykała for his valuable remarks. Thanks as well to all people who improved the text by catching typos or a lacking plot. Last but not least, a big Thank you! for all commenters in the topic with regard to this analysis on Hacker News and really anyone who showed interest in this article.
1: source: https://en.wikipedia.org/wiki/Causality, access: 02JUN2018 ←
2: source: https://support.sas.com/rnd/app/ets/examples/granger/index.htm, access: 02JUN2018 ←