Data Science/Data Analysis jobs are dead. Long Live Data Science jobs? How ChatGPT and Noteable enable the next gen of predictive analytics

Tanya Silva
26 min readAug 15, 2023

--

Photo by Andrea De Santis on Unsplash

I have been playing with LLMs for a while by now, but today I decided to deviate from “what is currently on the market I can use to build my own LLM” to “let’s see what they came up with again I can use for ( relatively) free”. Here is a setup where for $20 per month ( OpenAI ChatGPt sub) you can have yourself a personal data scientist/data analyst who will do the work for you.

Let’s dig in if the Data Scientist/Data Analyst job is dead.

Tools: chat.openai.com ( chatgpt4 sub with plugin is needed) plus https://app.noteable.io/ ( free tier is good enough). A simple Google search on csv returned Business Employment data from New Zealand.

Rules: do not lead the robot, make it as vague as possible and see what it can do.

Results: questionable. As you can see below, it was mostly prompted by “yes, please” and, in a few instances, guided to a more specific area, but overall the robot did it all ( python code, reasoning, graphs, etc.) by itself. If you have experience in predictive analytics like me, then it is super easy to guide the robot to literally walk through every known use case in machine learning, unveiling step by step the secrets of the data set. If you do not have that knowledge, the robot will most likely “hallucinate”. Therefore, the data scientist job will most likely shift from writing the code and debugging, to overseeing the robot’s job and “navigating” the workflow.

Verdict: Impressive. I probably would not personally invest into the Data Analysis major at this point. Why? Because robots for sure will take over the domain in the nearest future. However, does it mean humans have no chance? Not at all. Think about it like this — we used to have horses to travel between New York and SF, and it took months, these days you can take a supersonic jet, and it takes only a few hours to cover the distance. The result is still the same — getting from point A ( New York or, in data case, having a raw data without insights) to point B ( SF or, in this case valuable insights with the action plan derived from the data) in a minimum amount of time needed. Of course, some people will say it is a journey and not a destination, but if you never spent time being stuck in traffic in the Nevada desert on I-80 ( or waiting for the query to finish where calculations showed it will take “awhile”) you will not understand the excitement of the new tool (supersonic jet).

SME comments: I did not debug or tested the data. If you are a data scientist, I highly suggest having DQ checks ( data quality) and QA ( quality assurance) unit tests implemented in the workflow. Because at this point in time (mid-August 2023), you do not know if the robot is BS-ing you ( noteable.io provides the code, but that’s the point- check the code via test cases) or if the data is wrong ( if you are not an SME on the employment market in New Zealand, the results might be mathematically correct, but have no meaning in the real world) or if everything is perfectly correct ( yay for New Zealand having growth in Administrative job sector!)

Last comments: ChatGPT provided nice looking graphs and such, but I was too lazy to copy it all here. I have copied a few because they looked nice. Noteable.io is VERY impressive.

Tanya: Please set my default project in Notable to https://app.noteable.io/p/1a19e254-37d1-4c24-bcbf-cdd9071f3d32/Testing

ChatGPT :Used Noteable

Your default project in Noteable has been successfully set to Testing. You can now work within this project by default. Is there anything else I can assist you with?

Tanya: Please analyze the data in the business-employment-data-march-2023-quarter-industry-revisions.csv file and produce some data analysis and visualtions

ChatGPT : Used Noteable

It seems like I made a mistake in identifying the file by its name. Let’s first find the file in your project so that I can access its details.

I’ll list the files in your project, and you can confirm the correct one for analysis. Used Noteable

Great! I found the file “business-employment-data-march-2023-quarter-industry-revisions.csv” in your project.

Let’s start by loading the data and taking a quick look at the first few rows to understand its structure. Shall we?

Tanya: yes, please

ChatGPT: Used Noteable

I’ve created a new notebook for our analysis. You can follow along with the progress here.

Let’s start by loading the data and taking a look at the first few rows.

The data has been successfully loaded, and it appears to contain information about employment statistics for different industries. Here’s a brief overview of the columns:

  • index: Index column
  • quarter: Quarter of the year
  • SER_REF: Service Reference
  • industry_code: Industry Code
  • industry_name: Name of the Industry
  • filled jobs: Number of filled jobs
  • filled jobs revised: Revised number of filled jobs
  • filled jobs diff: Difference in filled jobs
  • filled jobs % diff: Percentage difference in filled jobs
  • total_earnings: Total earnings

The data seems to be truncated in the preview, but we can proceed with some exploratory data analysis (EDA) to understand the trends, distributions, and relationships within the data.

Would you like to proceed with specific analyses or should I provide a general overview of the data?

Tanya: general review first, please, EDA only. Also, include only the top 2 rows for sale data

ChatGPT : Certainly! Let’s start with a general exploratory data analysis (EDA) to understand the data’s characteristics, distribution, and potential relationships. We’ll also include only the top 2 rows for a quick preview.

Here’s what we’ll do:

  1. Summary Statistics: Provide summary statistics for numerical columns.
  2. Data Types and Missing Values: Check the data types and any missing values.
  3. Visualizations: Create visualizations to understand the distribution and relationships between variables.

Let’s start with the summary statistics and the first two rows of the data.

Here’s a summary of the general review:

Summary Statistics

  • Quarter: Ranges from 2020.09 to 2022.12.
  • Filled Jobs: The average number of filled jobs is approximately 215,672, with a minimum of 5,271 and a maximum of 2,233,837.
  • Filled Jobs Revised: Similar to filled jobs, with slight revisions.
  • Filled Jobs Diff: The difference in filled jobs ranges from -161 to 28,071, with an average of 2,063.
  • Filled Jobs % Diff: The percentage difference in filled jobs ranges from -0.2% to 4.4%, with an average of 1.03%.
  • Total Earnings: The average total earnings are 3,653, with a range from 140 to 40,646.
  • Total Earnings Revised: Similar to total earnings, with slight revisions.
  • Earnings Diff: The difference in earnings ranges from -2 to 512, with an average of 37.28.
  • Earnings % Diff: The percentage difference in earnings ranges from -0.7% to 18.1%, with an average of 1.16%.

First Two Rows

  • Agriculture, Forestry, and Fishing: 93,195 filled jobs, revised to 93,706, with a difference of 511 (0.5%) and total earnings of 1,205.
  • Mining: 5,340 filled jobs, revised to 5,355, with a difference of 15 (0.3%) and total earnings of 145.

Next, we can check the data types and any missing values, and then proceed with visualizations. Shall we continue?

Tanya: yes, please

ChatGPT : Great! Let’s continue with the exploratory data analysis (EDA) by checking the data types of each column and looking for any missing values. This will help us understand the structure of the data and ensure that it’s in the correct format for analysis.

After that, we’ll create some visualizations to explore the distribution and relationships between variables.

Let’s start by checking the data types and missing values.

Data Types and Missing Values

Data Types

  • quarter: Float64
  • SER_REF: Object (String)
  • industry_code: Object (String)
  • industry_name: Object (String)
  • filled jobs: Integer
  • filled jobs revised: Integer
  • filled jobs diff: Integer
  • filled jobs % diff: Float64
  • total_earnings: Integer
  • total earnings revised: Integer
  • earnings diff: Integer
  • earnings % diff: Float64

Missing Values

There are no missing values in any of the columns, which is great as it means we don’t have to handle any missing data.

Now, let’s move on to visualizations to explore the distribution and relationships between variables. We’ll create the following visualizations:

  1. Histograms: To understand the distribution of numerical variables.
  2. Correlation Heatmap: To understand the relationships between numerical variables.
  3. Bar Plot: To visualize the average filled jobs by industry.

Shall we proceed with these visualizations?

Tanya: yes, please

ChatGPT : Excellent! Let’s start by creating histograms for the numerical variables to understand their distribution. This will give us insights into the range, central tendency, and spread of the data.

We’ll create histograms for the following columns:

  • filled jobs
  • filled jobs revised
  • filled jobs diff
  • filled jobs % diff
  • total_earnings
  • total earnings revised
  • earnings diff
  • earnings % diff

Let’s begin with the histograms.

Here are the histograms for the numerical variables:

  1. Filled Jobs:
  1. Filled Jobs Revised:
  2. Filled Jobs Difference:
  3. Filled Jobs % Difference:
  4. Total Earnings:
  5. Total Earnings Revised:
  6. Earnings Difference:
  7. Earnings % Difference:

These histograms provide insights into the distribution of each variable. Most of the variables seem to be right-skewed, indicating that the majority of the values are concentrated on the lower end.

Next, we’ll create a correlation heatmap to understand the relationships between numerical variables, followed by a bar plot to visualize the average filled jobs by industry.

Shall we continue?

Tanya: yes, please

ChatGPT : Great! Let’s continue with the exploratory data analysis (EDA) by creating a correlation heatmap. This will help us understand the relationships between numerical variables and identify any strong correlations that might exist.

After that, we’ll create a bar plot to visualize the average filled jobs by industry, which will provide insights into the employment trends across different industries.

Let’s start with the correlation heatmap.

Here’s the correlation heatmap for the numerical variables:

The heatmap shows the correlation coefficients between pairs of numerical variables. A correlation close to 1 indicates a strong positive relationship, while a correlation close to -1 indicates a strong negative relationship.

Some observations:

  • filled jobs and filled jobs revised have a strong positive correlation, as expected.
  • total_earnings and total earnings revised also have a strong positive correlation.
  • Other variables show moderate to weak correlations with each other.

Next, let’s create a bar plot to visualize the average filled jobs by industry. This will provide insights into the employment trends across different industries.

Shall we proceed?

Tanya: yes please

ChatGPT : Great! Let’s create a bar plot to visualize the average filled jobs by industry. This will help us understand which industries have the highest and lowest average employment.

We’ll sort the industries by the average number of filled jobs to make the visualization more informative.

Let’s create the bar plot.

Here’s the bar plot showing the average filled jobs by industry:

This visualization provides insights into the employment trends across different industries. Some industries have significantly higher average employment, while others have lower levels.

Summary of EDA

  • Summary Statistics: Provided insights into the range, mean, and other statistical measures of numerical variables.
  • Data Types and Missing Values: Checked the data types and confirmed no missing values.
  • Histograms: Visualized the distribution of numerical variables.
  • Correlation Heatmap: Analyzed the relationships between numerical variables.
  • Bar Plot: Visualized the average filled jobs by industry.

This general review gives us a good understanding of the data and can guide further analysis.

Is there anything specific you would like to explore further or any other questions you have about the data?

Tanya: yes, can you build any prediction models on this data please?

ChatGPT : Certainly! Building a predictive model requires defining a specific target variable (what we want to predict) and selecting appropriate features (what we’ll use to make the predictions).

Given the data, here are a few potential modeling scenarios:

  1. Time Series Forecasting: Predict future filled jobs or earnings for a specific industry using historical data.
  2. Regression Analysis: Predict a continuous variable (e.g., filled jobs or earnings) based on other features (e.g., industry, quarter).
  3. Classification Analysis: If we have a categorical target, we could predict categories based on features.

Could you please specify what you would like to predict and any particular features or industries you’d like to focus on? If you’re not sure, I can suggest a modeling approach based on the data.

Tanya: First, let’s predict future filled jobs or earnings for a specific industry using historical data.

ChatGPT : Certainly! Predicting future filled jobs or earnings for a specific industry using historical data is a time series forecasting task.

Could you please specify the industry you’d like to focus on? If you’re not sure, we can choose one of the industries with more data to ensure a robust analysis.

Once we have the industry, we’ll proceed with the following steps:

  1. Extract Data: Extract the time series data for the specified industry.
  2. Visualize the Time Series: Plot the time series to understand the trend and seasonality.
  3. Preprocess the Data: Handle any missing values, and possibly apply transformations.
  4. Build and Train the Model: Select a suitable time series model, such as ARIMA, and train it on the data.
  5. Evaluate the Model: Assess the model’s performance using appropriate metrics.
  6. Forecast Future Values: Use the model to forecast future filled jobs or earnings.

Please let me know the industry you’d like to focus on, or if you’d like me to choose one.

Tanya: Please select an industry at random

ChatGPT : Certainly! Let’s randomly select an industry from the available data and proceed with the time series forecasting for filled jobs.

We’ll start by listing the unique industries in the data and then randomly selecting one for analysis.

Let’s begin!

We have randomly selected the “Administrative and Support Services” industry for our time series forecasting.

Here’s the plan:

  1. Extract Data: Extract the time series data for the “Administrative and Support Services” industry.
  2. Visualize the Time Series: Plot the time series to understand the trend and seasonality.
  3. Preprocess the Data: Handle any missing values, and possibly apply transformations.
  4. Build and Train the Model: Select a suitable time series model, such as ARIMA, and train it on the data.
  5. Evaluate the Model: Assess the model’s performance using appropriate metrics.
  6. Forecast Future Values: Use the model to forecast future filled jobs.

Let’s start by extracting and visualizing the time series data for the “Administrative and Support Services” industry. Shall we?

Tanya: yes, please

ChatGPT : Great! Let’s start by extracting the time series data for the “Administrative and Support Services” industry. We’ll focus on the “filled jobs” column and visualize the trend over time.

We’ll plot the time series to understand the underlying pattern, trend, and seasonality in the data.

Let’s begin by extracting and visualizing the data.

Here’s the time series plot of filled jobs in the “Administrative and Support Services” industry:

The plot shows the trend of filled jobs over different quarters. We can observe some fluctuations, but there doesn’t appear to be a clear seasonal pattern.

Next, we’ll proceed with preprocessing the data, which may include handling any missing values and applying transformations if necessary. Then, we’ll build and train a time series model, such as ARIMA, to forecast future filled jobs.

Shall we continue with the preprocessing and model building?

Tanya: yes, please

ChatGPT : Great! Let’s continue with the preprocessing of the time series data.

Preprocessing is an essential step in time series analysis, and it may include:

  1. Checking for Missing Values: Ensuring that there are no missing values in the time series.
  2. Resampling: If needed, resampling the data to a consistent frequency (e.g., quarterly).
  3. Stationarity Check: Testing if the time series is stationary, as many models require this property.
  4. Transformations: Applying transformations if necessary to stabilize the variance or make the series stationary.

We’ll start by checking for missing values and testing the stationarity of the time series. If the series is not stationary, we may need to apply differencing or other transformations.

Let’s begin with these preprocessing steps.

Here are the preprocessing results:

Missing Values Check

  • Filled Jobs: No missing values.

Stationarity Check (Augmented Dickey-Fuller Test)

  • Test Statistic: -1.1283
  • p-value: 0.7037
  • Critical Values: The critical values at 1%, 5%, and 10% significance levels are -4.6652, -3.3672, and -2.8029, respectively.

The p-value is greater than 0.05, indicating that we fail to reject the null hypothesis that the time series has a unit root and is non-stationary.

Since the series is non-stationary, we may need to apply differencing or other transformations to make it stationary before modeling.

Would you like to proceed with differencing the series and building the ARIMA model, or do you have any specific preferences for the modeling approach?

Tanya: yes please

ChatGPT : Great! Since the time series is non-stationary, we’ll apply differencing to make it stationary. Differencing is a common preprocessing step in time series analysis, especially for ARIMA modeling.

We’ll proceed with the following steps:

  1. Apply Differencing: Apply first-order differencing to the series and check if it becomes stationary.
  2. Build ARIMA Model: Use the differenced series to build an ARIMA (AutoRegressive Integrated Moving Average) model.
  3. Train the Model: Fit the model to the data.
  4. Evaluate the Model: Assess the model’s performance using appropriate metrics.
  5. Forecast Future Values: Use the model to forecast future filled jobs.

Let’s start by applying differencing and checking if the series becomes stationary.

Here’s the plot of the differenced time series for filled jobs in the “Administrative and Support Services” industry:

Augmented Dickey-Fuller Test (Differenced Series)

  • Test Statistic: -5.9639
  • p-value: 2.0088e-07
  • Critical Values: The critical values at 1%, 5%, and 10% significance levels are -4.66519, -3.36719, and -2.80296, respectively.

The p-value is less than 0.05, indicating that the differenced series is stationary. This means we can proceed with building the ARIMA model using the differenced series.

Next, we’ll build and train an ARIMA model, evaluate its performance, and use it to forecast future filled jobs for the selected industry.

Shall we proceed with building the ARIMA model?

Tanya: yes, please

ChatGPT

Great! Let’s proceed with building the ARIMA (AutoRegressive Integrated Moving Average) model for the differenced time series of filled jobs in the “Administrative and Support Services” industry.

The ARIMA model has three main parameters:

  • p: Order of the autoregressive part (AR).
  • d: Order of differencing.
  • q: Order of the moving average part (MA).

We’ll use the differenced series, so the order of differencing dd will be 1. For the other parameters, we can use the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots to determine appropriate values, or we can use an automated approach to find the best parameters.

Let’s start by plotting the ACF and PACF to guide our selection of pp and qq, and then we’ll build the ARIMA model.

Shall we begin?

Tanya: yes, please

ChatGPT : Great! Let’s start by plotting the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for the differenced series. These plots will help us determine the appropriate values for the AR (p) and MA (q) components of the ARIMA model.

The ACF plot shows the correlation of the series with its lags, while the PACF plot shows the correlation of the series with its lags after removing the effect of previous lags.

Let’s plot the ACF and PACF.

It seems like we encountered an error while plotting the ACF and PACF. This could be related to the shape of the data or the specific lags used in the plot.

Let’s try a different approach by using an automated method to find the best ARIMA parameters (p, d, q) for our series. We can use a grid search to find the parameters that minimize the AIC (Akaike Information Criterion), a common criterion used to select models.

Shall we proceed with the automated parameter selection and model building?

Tanya: yes please

ChatGPT : Certainly! We’ll proceed with an automated approach to find the best ARIMA parameters (p, d, q) for our series. We’ll use a grid search to find the parameters that minimize the AIC (Akaike Information Criterion).

The steps will include:

  1. Grid Search: Iterate through different combinations of p, d, and q to find the best parameters.
  2. Build ARIMA Model: Build the ARIMA model using the best parameters.
  3. Train the Model: Fit the model to the data.
  4. Evaluate the Model: Assess the model’s performance using appropriate metrics.
  5. Forecast Future Values: Use the model to forecast future filled jobs.

Let’s start with the grid search to find the best parameters.

It seems like there was an issue with the code snippet. Let’s correct it and rerun the grid search to find the best ARIMA parameters.

We’ll iterate through different combinations of p, d, and q to find the parameters that minimize the AIC (Akaike Information Criterion).

Let’s correct and rerun the code.

Used Noteable

It appears that the error persists, and it’s likely related to the specific code structure or library compatibility.

To move forward, we can simplify our approach by manually specifying the ARIMA parameters (p, d, q) based on general guidelines and the nature of the data. We can then fit the model, evaluate it, and make forecasts.

Here’s a common starting point for the parameters:

  • p (AR order): 1 or 2, based on the autocorrelation of the series.
  • d (differencing order): 1, as we’ve already differenced the series once.
  • q (MA order): 1 or 2, based on the moving average nature of the series.

Would you like to proceed with this approach, or do you have specific preferences for the parameters?

Tanya: yes, please

ChatGPT : Great! Let’s proceed with building the ARIMA model using the following parameters:

  • p (AR order): 1
  • d (differencing order): 1
  • q (MA order): 1

These parameters are a common starting point for ARIMA modeling, and they should provide a reasonable fit for our time series data.

We’ll build the ARIMA model, fit it to the data, and then plot the fitted values along with the original series to visualize how well the model fits the data.

Let’s start building the ARIMA model.

It seems like we encountered an error because the ARIMA class is not defined. This could be due to a missing import statement or a compatibility issue with the library.

Let’s correct the import statement and rerun the code to build and fit the ARIMA model.

The ARIMA model with parameters (p=1, d=1, q=1) has been successfully fitted to the time series data for filled jobs in the “Administrative and Support Services” industry.

Here’s the plot showing the original series and the fitted values:

Model Summary

  • AR Coefficient (ar.L1): -0.9054
  • MA Coefficient (ma.L1): 1.0000
  • Variance (sigma2): 8.894e+06

The AR coefficient is significant, as indicated by the p-value close to 0. The MA coefficient is not as significant, with a p-value of 0.139.

The model seems to capture the general trend and seasonality of the data, but further tuning and validation might be needed for more accurate forecasting.

Would you like to proceed with forecasting future filled jobs using this model, or do you have any specific adjustments or requests?

Tanya: let’s do regression analysis on Predict a continuous variable (e.g., filled jobs or earnings) based on other features (e.g., industry, quarter).

ChatGPT : Certainly! Regression analysis is a powerful method for predicting a continuous variable based on other features. In this case, we can predict filled jobs or earnings based on features such as industry and quarter.

Since we have selected the “Administrative and Support Services” industry for our analysis, we can use the quarter as a feature to predict filled jobs or earnings.

Here’s the plan:

  1. Feature Engineering: Create features such as the quarter number.
  2. Data Splitting: Split the data into training and testing sets.
  3. Model Building: Build a regression model (e.g., linear regression).
  4. Model Training: Train the model on the training data.
  5. Model Evaluation: Evaluate the model on the testing data.
  6. Prediction: Make predictions for future quarters.

Let’s start with feature engineering and data splitting. Shall we?

Tanya: yes, please

ChatGPT : Great! Let’s start with feature engineering and data preparation. We’ll create features such as the quarter number and split the data into training and testing sets.

For the regression analysis, we’ll predict the “filled jobs” in the “Administrative and Support Services” industry based on the quarter.

Here’s what we’ll do:

  1. Extract Relevant Data: Focus on the “filled jobs” for the selected industry.
  2. Create Features: Convert the quarter into numerical features.
  3. Split Data: Divide the data into training and testing sets.

Let’s begin with these steps.

We have successfully prepared the data for regression analysis. Here’s a preview of the training data:

Features (X_train)

quarter_number618310

Target (y_train)

filled jobs98098936531029819067199334

The feature “quarter_number” represents the sequential number of the quarter, and the target is the number of filled jobs in the “Administrative and Support Services” industry.

Next, we’ll build a linear regression model, train it on this data, and evaluate its performance.

Shall we proceed?

Tanya: yes, please

ChatGPT : Great! Let’s proceed with building and training a linear regression model.

Linear regression is a simple and widely used method for predicting a continuous variable based on one or more features. In this case, we’ll use the quarter number as the feature to predict filled jobs in the “Administrative and Support Services” industry.

Here’s what we’ll do:

  1. Build the Model: Create a linear regression model.
  2. Train the Model: Fit the model to the training data.
  3. Evaluate the Model: Assess the model’s performance on the testing data.
  4. Visualize the Fit: Plot the original data and the regression line.

Let’s start by building and training the linear regression model.

The linear regression model has been successfully built and trained to predict filled jobs in the “Administrative and Support Services” industry based on the quarter number.

Here’s the plot showing the original data and the regression line:

Model Evaluation

  • Mean Squared Error (MSE): 5,557,663.32
  • R-squared (R²): -0.1641

The R-squared value is negative, indicating that the linear model is not fitting the data well. This could be due to the simplicity of the linear model and the complex underlying pattern in the data.

We may consider using more sophisticated models or additional features to capture the underlying pattern. Alternatively, we could explore other types of regression models, such as polynomial regression, to better fit the data.

Would you like to explore other modeling approaches, or do you have any specific requests or questions?

Tanya: Please suggest which variable we should use for Classification Analysis: If we have a categorical target, we could predict categories based on features.

ChatGPT : Certainly! Classification analysis is used to predict a categorical target variable based on features. In the dataset we have, we can consider the following options for classification analysis:

  1. Industry Classification: If we want to predict the industry category based on features like filled jobs, earnings, etc., we can use the “industry” variable as the target. This would involve predicting the industry type based on employment-related metrics.
  2. Earnings Category: We could create a categorical target based on earnings, such as “High Earnings” and “Low Earnings,” by defining thresholds. Features could include filled jobs, industry type, and quarter.
  3. Employment Growth Category: We could create a categorical target based on the growth of filled jobs, such as “Growing” and “Declining.” Features could include industry type, quarter, and earnings.
  4. Seasonal Employment Category: If there are seasonal patterns in employment, we could create categories like “Peak Season” and “Off Season” and predict them based on industry, filled jobs, and earnings.

Please let me know which option interests you, or if you have a specific idea in mind for classification analysis!

Tanya: Please create “Growing” and “Declining” target variables and do a classification

ChatGPT: Certainly! We can create a “Growing” and “Declining” target variable based on the growth of filled jobs in the selected industry (“Administrative and Support Services”). We’ll define the target categories as follows:

  • “Growing”: If the filled jobs in the current quarter are greater than the previous quarter.
  • “Declining”: If the filled jobs in the current quarter are less than or equal to the previous quarter.

We’ll then perform classification analysis using this target variable. The features can include the quarter number and other relevant variables.

Here’s the plan:

  1. Create Target Variable: Define the “Growing” and “Declining” categories.
  2. Feature Engineering: Create relevant features.
  3. Data Splitting: Split the data into training and testing sets.
  4. Model Building: Build a classification model (e.g., logistic regression).
  5. Model Training: Train the model on the training data.
  6. Model Evaluation: Evaluate the model on the testing data.

Let’s start with creating the target variable and preparing the data. Shall we?

Tanya: yes, please

ChatGPT : Certainly! Let’s begin by creating the target variable “Growth Category,” which will have two classes: “Growing” and “Declining.” We’ll define these categories based on the growth of filled jobs in the selected industry.

We’ll also create relevant features and split the data into training and testing sets.

Let’s start with these steps.

We have successfully created the target variable “Growth Category” with two classes: “Growing” and “Declining.” Here’s a preview of the classification data:

indexquarterfilled jobsGrowth Categoryquarter_number12020.1292742Declining122021.0390671Declining232021.0698884Growing342021.0994843Declining452021.1298098Growing5

The feature “quarter_number” represents the sequential number of the quarter, and the target is the growth category based on filled jobs in the “Administrative and Support Services” industry.

Next, we’ll build a classification model (e.g., logistic regression), train it on this data, and evaluate its performance.

Shall we proceed?

Tanya: yes please

ChatGPT: Great! Let’s proceed with building and training a logistic regression model for classification.

Logistic regression is a widely used method for binary classification. In this case, we’ll use it to classify the growth category (“Growing” or “Declining”) based on the quarter number.

Here’s what we’ll do:

  1. Build the Model: Create a logistic regression model.
  2. Train the Model: Fit the model to the training data.
  3. Evaluate the Model: Assess the model’s performance on the testing data using metrics like accuracy.
  4. Visualize the Results: Plot the original data and the decision boundary.

Let’s start by building and training the logistic regression model.

Code built by the robot

The logistic regression model has been successfully built and trained to classify the growth category (“Growing” or “Declining”) based on the quarter number for the “Administrative and Support Services” industry.

Here’s the plot showing the original data and the decision boundary:

Model Evaluation

  • Accuracy: 50%
  • Confusion Matrix:
  • True Positive (Growing): 0
  • False Negative (Growing): 0
  • False Positive (Declining): 1
  • True Negative (Declining): 1

The model’s accuracy is 50%, indicating that it correctly classified half of the testing data. The confusion matrix shows that the model has difficulty distinguishing between the two classes.

This result may be due to the simplicity of the model and the limited features used. The quarter number alone may not be sufficient to capture the complex underlying pattern in the data.

We may consider using more sophisticated models or additional features to improve classification performance. Alternatively, we could explore other types of classification models or techniques.

Would you like to explore other modeling approaches, or do you have any specific requests or questions?

Tanya: explain the results like I am five

ChatGPT : Sure! Let’s break it down into a simple explanation:

Imagine you have a toy that can tell if a plant is growing or not growing. You show it pictures of a plant taken every three months (a quarter), and it tries to guess if the plant is growing or not growing.

  1. The Toy (Model): We made a toy (called a logistic regression model) that looks at the pictures (quarters) and tries to guess if the plant (jobs in a business area) is growing or not growing.
  2. The Pictures (Data): We showed the toy some pictures where we already knew if the plant was growing or not. We also kept some pictures hidden to test the toy later.
  3. Teaching the Toy (Training): We let the toy look at the pictures and learn from them. It tried to find a pattern to tell if the plant was growing or not.
  4. Testing the Toy (Evaluation): We then showed the toy the hidden pictures to see how well it could guess. It got half of them right and half of them wrong.
  5. The Red Line (Decision Boundary): The red line in the picture is like a seesaw. If the toy thinks the plant is more on the “growing” side, it guesses “growing.” If it thinks the plant is more on the “not growing” side, it guesses “not growing.”
  6. The Result: The toy didn’t do a great job. It’s like flipping a coin; sometimes it’s right, sometimes it’s wrong. It needs to learn more or maybe look at other things (like how much water or sunlight the plant gets) to guess better.

So, our toy is a fun start, but it needs more practice and maybe some new tricks to be really good at telling if the plant is growing or not! 🌱😊

Tanya: what other predictions can you make out of this data?

ChatGPT : Certainly! The dataset we have contains information about different industries, filled jobs, and earnings over various quarters. Here are some other predictions and analyses we can perform:

  1. Industry Comparison: Compare different industries to identify trends, growth patterns, or seasonal variations.
  2. Earnings Prediction: Build models to predict future earnings for a specific industry or across all industries.
  3. Job Growth Classification for Other Industries: Apply the same classification analysis to other industries to identify growing or declining sectors.
  4. Time Series Forecasting for Multiple Variables: Use advanced time series models to forecast both filled jobs and earnings simultaneously.
  5. Cluster Analysis: Group industries into clusters based on similar trends or patterns in employment and earnings.
  6. Anomaly Detection: Identify unusual patterns or anomalies in the data, such as unexpected spikes or drops in employment.
  7. Seasonal Analysis: Analyze seasonal patterns in employment and earnings, and predict peak and off-peak periods.
  8. Impact of External Factors: If additional data is available (e.g., economic indicators, government policies), we can analyze the impact of these factors on employment and earnings.
  9. Multi-Class Classification: Create more categories for growth (e.g., “Rapidly Growing,” “Slowly Growing,” “Stable,” “Declining”) and classify industries accordingly.
  10. Regional Analysis: If data is available for different regions or locations, we can perform regional comparisons and predictions.
  11. Correlation Analysis: Analyze the relationship between filled jobs and earnings to understand how they influence each other.
  12. Sentiment Analysis: If additional data like news articles or social media posts is available, we can analyze public sentiment towards different industries.

These are just a few examples of the analyses and predictions that can be made with this data. Depending on the specific goals and available additional data, we can tailor the analysis to provide valuable insights.

Please let me know if any of these options interest you or if you have a specific question or request in mind!

Tanya: please do anomaly detection

ChatGPT : Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior. It’s like finding the odd one out in a series of data points. In the context of our dataset, anomalies could represent unexpected spikes or drops in filled jobs or earnings that might indicate an error, an unusual event, or an important trend.

For the “Administrative and Support Services” industry, we can apply anomaly detection to the filled jobs data to identify any unusual quarters.

Here’s the plan:

  1. Choose a Method: Select an appropriate anomaly detection method. Common methods include statistical tests, machine learning models, and time series analysis.
  2. Preprocess the Data: Prepare the data for analysis, including any necessary transformations or scaling.
  3. Detect Anomalies: Apply the chosen method to identify anomalies in the data.
  4. Visualize the Results: Plot the data and highlight the anomalies to visualize the unusual patterns.
  5. Interpret the Results: Provide insights into what the anomalies might represent.

We’ll start by choosing a method and preprocessing the data. One common approach for time series data is to use the Z-score, which measures how many standard deviations a data point is from the mean. A high absolute Z-score indicates an anomaly.

Shall we proceed with this approach?

Tanya: yes please

ChatGPT : Great! Let’s proceed with anomaly detection using the Z-score method. We’ll apply this approach to the filled jobs data for the “Administrative and Support Services” industry.

Here’s what we’ll do:

  1. Calculate the Z-score: For each data point, we’ll calculate how many standard deviations it is from the mean.
  2. Identify Anomalies: We’ll define a threshold (e.g., a Z-score greater than 2) to identify anomalies. Data points with a Z-score beyond this threshold will be considered anomalies.
  3. Visualize the Results: We’ll plot the data and highlight the anomalies to visualize the unusual patterns.

Let’s start with these steps.

It seems like we encountered an error related to the “quarter” column not being present in the index. This happened because we were trying to access the “quarter” column, which was previously used as the index for the time series data.

I’ll fix this by resetting the index and including the “quarter” column in the data frame, and then we’ll re-run the anomaly detection process.

Let’s proceed with the corrected code.

Here’s the result of the anomaly detection for the “Administrative and Support Services” industry’s filled jobs data:

In this plot:

  • The blue line represents the number of filled jobs over different quarters.
  • The red dots are the anomalies, which are the unusual points that stand out from the rest.

It appears that there are no specific anomalies detected based on the Z-score method with a threshold of 2. This means that the filled jobs data for this industry does not have any extreme spikes or drops that would be considered unusual.

Anomalies can be very context-specific, and the threshold or method used might need to be adjusted based on domain knowledge or specific criteria. Additionally, other anomaly detection methods could be explored.

Would you like to explore further, adjust the parameters, or have any specific questions or requests?

Tanya: No.

Photo by Andy Kelly on Unsplash

The brave new world is here. The humans will no doubt adapt. And remember, robots can’t really do anything without humans. At least until cyber-mechanics are so advanced, our AI overlords will deside humans are not needed anymore…

#tanyatalks

--

--

Tanya Silva
Tanya Silva

Written by Tanya Silva

Check out www.tanyatalks.com to learn about me! All opinions are my own.

No responses yet