3.1. Data Collection and Analysis
The questionnaire was designed to include 12 factor-related questions, with all questions and associated references provided in
Appendix B. The survey was available for participants to complete for one month in February 2023, with the survey link distributed via email and a Slack channel. Participants’ responses were automatically collected and stored in a spreadsheet for later analysis. Prior to analysis, data pre-processing steps were performed in Jupyter Notebook to ensure that all data were complete and free of errors. The questionnaire was structured in a closed-answer format, requiring participants to provide numerical (integer) or text (string) responses for each question. In this study, a pre-processing phase is used to prepare the data for analysis. The pre-processing phase includes the following steps:
Data Cleaning: This step includes several tasks such as handling missing values, handling duplicates, and finding and correcting typos. For handling missing values, there are several methods that can be used depending on the concepts and variables including 1- Dropping missing values which remove all the rows and columns that contain missing values using dropna() method from Pandas library. 2- Replacing missing values which replace the missing values with the mean, median, or mode of the values of the variables using fillna() method from Pandas library. 3- Interpolation which estimates the missing values based on other variables’ values using interpolate() method from Pandas library. 4- Imputation which predicts the missing values using some machine learning techniques such as Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). All these algorithms are available in scikit-learn library for Python. In this study, based on the concept and needs, replacing missing values using fillna() method in Pandas library is used to handle the missing values. In this study, we employed the identical methodology as in our prior research (Imani & Arabnia, 2023).
Handling outliers: For handling outliers, we wrote the below function that uses the Interquartile Range (IQR) method to remove outliers from a dataset. The IQR is a measure of the spread of the data, defined as the difference between the third quartile (Q3) and the first quartile (Q1).
#function for removing outliers def remove_outliers(train,labels): for label in labels: q1 = train[label].quantile(0.25) q3 = train[label].quantile(0.75) iqr = q3 - q1 upper_bound = q3 + 1.5 * iqr lower_bound = q1 - 1.5 * iqr train[label] = train[label].mask(train[label]< lower_bound, train[label].median(),axis=0) train[label] = train[label].mask(train[label]> upper_bound, train[label].median(),axis=0) return train
|
The function first calculates the IQR for each label (i.e., column) in the dataset by using the quantile() method to find the 25th and 75th percentiles. Next, the function identifies outliers by finding values that are more than 1.5 times the IQR away from the upper or lower quartiles. These are considered extreme values that are unlikely to be representative of the underlying population and are typically removed. To remove the outliers, the function replaces values that are less than the lower bound with the median value of the column and values that are greater than the upper bound with the median value of the column. This is done using the mask() method, which replaces values in the specified range with the median value of the column. By using the IQR method to identify and remove outliers, this function provides a quick and simple way to clean datasets and improve the accuracy of statistical analyses.
Data Integration: This step aims to form a unified dataset from different sources using several tasks such as handling inconsistent data structures, handling data format differences, and handling naming conflicts. This study did not use this step as all the data were collected in a single source using a web-based survey.
Data Transformation: This step aims to convert the type of data into a suitable format for analysis. As mentioned earlier, in this study, the categorical string variables are converted to numeric variables using Pandas get_dummies() method in Python (Cerda, Varoquaux and Kégl, 2018; Sahoo et al., 2019) which is more suitable for quantitative analysis.
Data Reduction: This step includes several tasks to reduce the amount of data by removing noisy data, irrelevant data, and redundant data. To achieve this goal several dimensionality reduction techniques such as feature selection and principal component analysis (PCA) can be used. As all the questions in the survey were designed based on the specific purpose, there were no noisy and irrelevant data and therefore this study did not need to use this step for reducing the data.
Data Discretization: This step includes converting continuous data into discrete categories using some techniques such as binning, equal-width binning, and equal-frequency binning. This study used binning technique to form discrete categories for three variables such as year of experience, age, and the number of households.
Linearity check: This step refers to the relationship between two variables being analysed. Specifically, it refers to whether the relationship between the variables is a straight line or whether it exhibits a curved pattern. When conducting a Pearson correlation analysis, it is important to check for linearity to ensure that the relationship between the variables is accurately captured. A scatterplot can be used to visualize the relationship between two variables. If the points on the scatterplot form a straight line, then the relationship between the variables is linear.
Figure 1,
Figure 2 and
Figure 3 show the scatter plot between the variables.
Normality check: This step refers to whether the data being analysed follows a normal distribution or bell curve. In the context of a Pearson correlation analysis, it is important to check for normality to ensure that the data being analysed is appropriate for the analysis. A histogram can be used to check for normality. In a histogram, the frequency distribution of the data is displayed as bars. If the bars form a bell-shaped curve, then the data follows a normal distribution.
Homoscedasticity check: This step refers to whether the variability of the data is consistent across all levels of the variables being analysed. When conducting a Pearson correlation analysis, it is important to check for homoscedasticity to ensure that the results accurately reflect the relationship between the variables being analysed. A scatterplot can be used to check for homoscedasticity. If the variability of the data is consistent across all levels of the variables being analysed, then the scatterplot will show the points evenly scattered around the line of best fit. Alternatively, a residual plot can be used to check for homoscedasticity. In a residual plot, the residuals are plotted against the predicted values. If the points on the residual plot are evenly scattered around the horizontal line at zero, then the data is homoscedastic. As we do not aim to predict a variable based on other variables in the dataset, then calculating the residuals using the best-fit line is not the case in our study.
In summary, to conduct a Pearson correlation analysis, it is important to check for linearity, normality, and homoscedasticity to ensure that the relationship between the variables is accurately captured and the results are reliable.
3.2. Findings
Figure 4, describes the statistics of the variables used in this study. This table contains statistics for a sample of 67 individuals who participated in the survey and answered several questions focusing on various aspects of working from home (WFH) and their experience with it. The variables included in the table are id (identification number), age, gender (0=Female, 1=Male), years of experience, household (number of people living in their household), WFH% (percentage of time spent working from home), hybrid (whether they work in a hybrid model or not), collaboration problem (how often they face problems collaborating with colleagues while WFH), reduce promotion chance (how much WFH reduces their chances of promotion), impact of WFH on employer (how much WFH impacts their employer), feeling left out (how often they feel left out while WFH), feeling better with WFH (how much they feel better while WFH), and feeling more active with WFH (how much they feel more active while WFH).
The “count” row shows that there were 67 individuals surveyed for each variable. The “mean” row shows the average value for each variable across the sample. The “std” row shows the standard deviation of the values for each variable, indicating how much variation there is within the sample. The “min” and “max” rows show the minimum and maximum values for each variable in the sample. Finally, the “25%”, “50%”, and “75%” rows show the quartiles for each variable, which can be used to determine the range and distribution of values for each variable.
Based on the below figure, the average age of participants is 47.9, ranging from 28 to 64 with a standard deviation of 9.32.
There are several useful and insightful diagrams that can be created based on the above table. Here are some examples:
Histogram of Age Distribution:
Figure 5 shows the distribution of ages of the participants. The x-axis represents age, and the y-axis represents the frequency of individuals at each age. This can help us understand the age range of the participants and identify any patterns or trends.
Histogram of Gender Distribution:
Figure 6 shows the distribution of genders of the participants. The x-axis represents gender, and the y-axis represents the number of participants of each gender. This can help us understand the gender range of the participants and identify any patterns or trends.
Histogram of years of experience distribution:
Figure 7 shows the distribution of years of experience of the participants. The x-axis represents the number of years of experience, and the y-axis represents the number of participants. This can help us understand the experience range of the participants and identify any patterns or trends.
Figure 8 shows the distribution of the number of households of the participants. The x-axis represents the number of households, and the y-axis represents the number of participants. This can help us understand the number of households of the participants.
Figure 9 shows that the correlations between the variables in the study are generally weak. As the figure shows, a correlation coefficient of 0.14 suggests a weak positive correlation between working from home and collaboration problems, meaning that as the frequency of working from home increases, the collaboration problem also slightly increases. Similarly, a correlation coefficient of -0.08 suggests a weak negative correlation between working from home and reducing the chance of promotion, meaning that as the frequency of working from home increases, the chance of promotion slightly decreases. Finally, a correlation coefficient of 0.11 suggests a weak positive correlation between the number of households and the happiness of the employees while working from home, meaning that as the number of households increases, the happiness of the employees while working from home also slightly increases. It is important to note that while these correlations are statistically significant; the strength of the correlations is weak, indicating that the relationship between the variables is not very strong.
3.3. Hypotheses Analysis:
Figure 10 shows the p-values of the different variables.
Hypothesis 1:
The null hypothesis (H0) states that there is no correlation between working from home and collaboration, while the alternative hypothesis (H1) states that there is a correlation. The p-value for this hypothesis is 0.27.
If the significance level (α) is set to 0.05, this means that we would reject the null hypothesis if the p-value is less than 0.05, and fail to reject it if the p-value is greater than or equal to 0.05.
Since the p-value for this hypothesis is greater than α (0.27 > 0.05), we fail to reject the null hypothesis. Therefore, we cannot conclude that there is a significant correlation between working from home and collaboration.
Hypothesis 2:
The null hypothesis (H0) states that there is no correlation between the number of households of employees and employees’ happiness while WFH, while the alternative hypothesis (H1) states that there is a correlation. The p-value for this hypothesis is 0.374.
Again, if we set α to 0.05, we would reject the null hypothesis if the p-value is less than 0.05, and fail to reject it if the p-value is greater than or equal to 0.05.
Since the p-value for this hypothesis is greater than α (0.374 > 0.05), we fail to reject the null hypothesis. Therefore, we cannot conclude that there is a significant correlation between the number of households of employees and employees’ happiness while WFH.
Hypothesis 3:
The null hypothesis (H0) states that there is no correlation between working from home and employees’ promotion chance, while the alternative hypothesis (H1) states that there is a correlation. The p-value for this hypothesis is 0.503.
Once again, if we set α to 0.05, we would reject the null hypothesis if the p-value is less than 0.05, and fail to reject it if the p-value is greater than or equal to 0.05.
Since the p-value for this hypothesis is greater than α (0.503 > 0.05), we fail to reject the null hypothesis. Therefore, we cannot conclude that there is a significant correlation between working from home and employees’ promotion chances while WFH.
In conclusion, all three hypotheses failed to provide evidence to reject their respective null hypotheses. Therefore, we cannot conclude that there is a significant correlation between working from home and collaboration, employees’ happiness while WFH, and employees’ promotion chances while WFH, respectively.