Skip to main content

Get All Stock Code Information from TWSE

· 3 min read
Zephyr
Engineer

title

Cover Image: Automatically generated by GPT-4 after reading this article.


The stock codes in the Taiwanese market change periodically, making manual tracking impractical.

It's time to automate!

Setup Environment

Let's tackle the challenge by writing a program. First, install the necessary packages:

pip install requests beautifulsoup4 json
tip

Assuming you have a functional Python environment.

Target Webpages

Stock-related data resides on the website of the Taiwan Stock Exchange (TWSE). Let's identify the target pages:

TWSE

Note down these three URLs:

urls = [
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=2", # Listed securities
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=4", # OTC securities
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=5" # Emerging stocks
]

Parsing the Webpage

stock_table

Upon inspecting the webpage, we identify that the main table corresponds to the HTML tag class=h4.

Now that we've located our target, let's start coding:

update_stocks_code.py
import json
import requests
from bs4 import BeautifulSoup

# Retrieve content from Taiwan Stock Exchange announcements
urls = [
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=2", # Listed securities
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=4", # OTC securities
"https://isin.twse.com.tw/isin/C_public.jsp?strMode=5" # Emerging stocks
]

# All data infos
data = {}

total_urls = len(urls)
for index, url in enumerate(urls, start=1):
print(f"Processing URL {index}/{total_urls}: {url}")

response = requests.get(url)
response.encoding = 'big5' # Set the correct encoding

# Parse HTML using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'h4'})

if not table:
print(f"Table not found for URL: {url}")
continue

for row in table.find_all('tr')[1:]: # Skip header row
cells = row.find_all('td')
if len(cells) != 7:
continue

code, name = cells[0].text.split("\u3000")
internationality = cells[1].text
list_date = cells[2].text
market_type = cells[3].text
industry_type = cells[4].text

data[code] = {
"Name": name,
"Code": code,
"Market Type": market_type,
"Industry Type": industry_type,
"Listing Date": list_date,
"International Code": internationality
}

with open("stock_infos.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)

print("All data has been processed and saved to stock_infos.json")

Output Results

stock_infos.json
{
"1101": {
"Name": "Taiwan Cement",
"Code": "1101",
"Market Type": "Listed",
"Industry Type": "Cement Industry",
"Listing Date": "1962/02/09",
"International Code": "TW0001101004"
},
"1102": {
"Name": "Asia Cement",
"Code": "1102",
"Market Type": "Listed",
"Industry Type": "Cement Industry",
"Listing Date": "1962/06/08",
"International Code": "TW0001102002"
},
...omitting the rest...
}

We output the result as a JSON file for convenient integration with other programs.

FAQs

I only want ordinary stocks.

I assume you mean stocks with "four-digit" codes, excluding ETFs, warrants, etc. To achieve this, simply add a filtering condition in the program:

if len(code) != 4:
continue

I only want specific industries.

This requirement can be extended to specific market types, industry types, listing dates, or even the previous "ordinary stocks" question. We just need to load the output JSON file into Pandas and filter with conditions:

import pandas as pd

df = pd.read_json("stock_infos.json", orient="index")
target = df[df["Industry Type"] == "Cement Industry"]

The program is broken.

That could be due to changes in the TWSE website's layout, causing the HTML structure to alter. We'll need to adjust the code accordingly.

Conclusion

By periodically running this program, we can obtain the latest stock information.

If you have any questions or suggestions, feel free to leave a comment below!