人文学・社会学のための統計・線形代数学カリキュラム(日記)

Last updated at 2025-05-07Posted at 2025-05-06

第1部：人文・社会学におけるデータリテラシー基礎

1.1 なぜ数理が必要か？（拡張版）

背景：人文学と社会科学における「データ化」の流れ

定性的アプローチ（例：インタビュー、歴史資料、文献読解）は今も中心。
しかし近年では、
- 調査票やアンケートによる量的データの収集、
- テキストや画像をベクトル化する自然言語処理・計量的分析、
- 統計的仮説検定によるエビデンスの提示
  が、研究の信頼性を高める手段として注目されている。

代表的な応用事例

分野	具体例	数理的アプローチ
教育社会学	家庭の学歴・収入と進学率の関係	クロス集計・回帰分析
文学研究	登場人物の関係性分析	ネットワーク分析・行列操作
歴史学	地域ごとの言語・宗教の分布変化	GISとヒストグラム
文化人類学	儀式の頻度・構造比較	主成分分析・クラスター分析
言語学	単語使用頻度と社会属性	TF-IDF・相関分析

データリテラシーに必要な3つの力

読む力：表・グラフ・数値の意味を理解する
作る力：データを整理し、可視化・加工する
疑う力：数値の背後にあるバイアスや前提を見抜く

推奨ツール紹介（入門向け）

ツール	特徴	習得目安
Excel	直感的操作、統計関数、可視化	数時間で可能
Google Colab (Python)	無料・インストール不要、再現性の高い実習	やや中級
R（RStudio）	統計分析に強い、ggplot2で美しいグラフ	学術的利用が多い

第2部：記述統計の実践（Descriptive Statistics in Practice）

目標

中央傾向・ばらつき・比率の理解と可視化
実社会のデータ（語彙数・学歴・書評など）を扱う統計力を養成

2.1 平均・中央値・最頻値（代表値）

項目	内容例	Python対応
社会的な例	世帯所得、語彙数、文章の文字数	`df.describe()`, `df.mode()`
実装演習	文学テキストの語彙数、書評文の文字数など	`len(text.split())` の集計
可視化	散布図、棒グラフ、ラベル付き統計表	`sns.histplot`, `plt.axvline()`

2.2 分散・標準偏差・偏差値（ばらつきと相対化）

項目	内容例	Python対応
実データの例	評点分布（GPA）、書評の星数の散らばり	`df.std()`, `df.var()`
偏差値の導入	標準化 → 相対位置の評価	自作関数または`StandardScaler`
可視化	ヒストグラム、箱ひげ図、Zスコア表示	`sns.boxplot`, `plt.hist()`

2.3 クロス集計と比率（カテゴリ変数の整理）

項目	内容例	Python対応
社会調査例	性別×大学進学、新聞購読×投票意向	`pd.crosstab()`, `groupby().mean()`
割合の可視化	ヒートマップ、積み上げ棒グラフなど	`sns.heatmap`, `df.plot(kind='bar')`
社会学的考察	構造的偏差やカテゴリーのバランス感覚	分析後の言語的考察をレポートで補完

第3部：推測統計と因果関係の理解（Inference and Causal Thinking）

目標

部分から全体を推測する思考法の習得
社会学的に有意味な「差」や「関係性」の統計的評価

3.1 サンプリングと母集団

項目	内容例	Python対応
抽出と推定	世論調査、図書館利用者の一部から全体を推測	`np.random.choice()`, `bootstrap`関数
演習課題	文学作品の読者層調査の擬似標本を作成	ブートストラップ平均分布の描画
可視化	標本平均の分布、標本サイズと精度の関係	`plt.hist()`, `plt.errorbar()`

3.2 信頼区間と誤差の感覚

項目	内容例	Python対応
調査例	賛否アンケートで「±5%の誤差」とは？	`scipy.stats.t.interval()`
誤差の視覚化	棒グラフにエラーバーを表示	`plt.bar()` + `yerr` 引数
教育効果	「誤差がある推測」の感覚と、慎重な結論の引き出し	数値＋ビジュアル両方で定着させる

3.3 仮説検定とp値

項目	内容例	Python対応
t検定	ジェンダー別SNS利用時間・語彙量の違い	`scipy.stats.ttest_ind()`
χ²検定	性別×投票傾向、メディア使用×意見の一致度	`scipy.stats.chi2_contingency()`
p値の解釈	小さいpは「有意」？「意味がある」？の誤解を解く	`print(f"p = {p_val:.3f}")`

3.4 相関と回帰

項目	内容例	Python対応
相関	SNS使用時間と幸福度、学習時間とGPA	`df.corr()`, `sns.heatmap()`
単回帰分析	`statsmodels.formula.api.ols()`	`ols('happiness ~ sns_use', data=df).fit()`
可視化	散布図＋回帰直線	`sns.lmplot()`
発展	多変量回帰、交互作用項、標準化係数の導入	`ols('happiness ~ sns_use + study_time')`

第4部：線形代数学の応用（Applied Linear Algebra）

節	内容	学習目標とPython実装例
4.1	ベクトルの基礎と応用	文書や人・地域を数値特徴ベクトルで表現（例：TF-IDF, トピック数など） 🔧 `np.array`, `.reshape()`
4.2	行列によるデータ表現	出席行列や推薦システムのユーザー×アイテム行列を構築 🔧 `np.dot`, `df.values.T`
4.3	固有値・固有ベクトルと意味論	グラフの構造やPageRankにおける重要性の源泉 🔧 `np.linalg.eig`, `networkx.eigenvector_centrality()`
4.4	主成分分析（PCA）と文化類型	高次元データ（例：文化価値観や意識調査）を2次元へ圧縮し可視化＋分類 🔧 `sklearn.decomposition.PCA`

第5部：応用編（人文学・社会調査への実装）

節	内容	実践課題例とPython対応
5.1	テキストの数理表現	文学作品や新聞記事のBag-of-Words表現／TF-IDF変換 🔧 `TfidfVectorizer`, `CountVectorizer`
5.2	社会調査データの分析	総務省e-Stat、World Bank等のCSVを加工・分析 🔧 `pandas`, `groupby`, `merge`, `seaborn`
5.3	ネットワークと関係性分析	文学の人物ネットワークやインタビュー応答者の関係を可視化・分析 🔧 `networkx`, `centrality`, `spring_layout`

# -*- coding: utf-8 -*-
# プログラム名: sociology_literature_stats_practice.py
# 社会学・文学データを用いた記述統計演習 / Descriptive Stats for Sociology & Literature

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- 架空の社会学・文学データセット / Example Dataset for Sociology & Literature ---
data = {
    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F'],
    'read_time_min': [35, 42, 30, 50, 60, 28, 40, 55, 38],  # 読書時間（分）/ Reading time in minutes
    'literary_vocab_size': [1200, 1500, 1100, 1600, 1700, 1300, 1400, 1800, 1250],  # 文学語彙数
    'fav_genre': ['詩', '小説', '小説', '評論', '詩', '詩', '評論', '小説', '詩'],  # 好きなジャンル
    'vote_behavior': ['保守', 'リベラル', 'リベラル', '保守', '中立', 'リベラル', '保守', '保守', '中立']  # 投票傾向
}
df = pd.DataFrame(data)

# 1. 平均・中央値・最頻値 / Mean, Median, Mode
print("【代表値】")
print("平均（読書時間, 語彙数）:\n", df[['read_time_min', 'literary_vocab_size']].mean())
print("中央値:\n", df[['read_time_min', 'literary_vocab_size']].median())
print("最頻値:\n", df.mode(numeric_only=True).iloc[0])

# 2. 標準偏差・偏差値 / Std. Dev. & Deviation
print("\n【ばらつき】")
print("標準偏差:\n", df[['read_time_min', 'literary_vocab_size']].std())

def calc_deviation(x):
    return 50 + 10 * (x - x.mean()) / x.std()

df['vocab_dev'] = calc_deviation(df['literary_vocab_size'])
print("語彙数の偏差値:\n", df['vocab_dev'])

# 3. 可視化 / Visualization
plt.figure()
sns.boxplot(data=df[['read_time_min', 'literary_vocab_size']])
plt.title('Boxplot of Reading Time and Literary Vocabulary')
plt.show()

plt.figure()
sns.histplot(df['read_time_min'], kde=True, bins=5)
plt.title('Histogram of Reading Time')
plt.xlabel('Minutes')
plt.ylabel('Count')
plt.show()

# 4. クロス集計 / Cross Tab
print("\n【クロス集計】")
genre_by_gender = pd.crosstab(df['gender'], df['fav_genre'], normalize='index')
print("性別とジャンルの関係（行比率）:\n", genre_by_gender)

vote_by_genre = pd.crosstab(df['fav_genre'], df['vote_behavior'], normalize='index')
print("\nジャンルと投票傾向（行比率）:\n", vote_by_genre)

# 5. グループ別平均
print("\nジャンルごとの平均読書時間:")
print(df.groupby('fav_genre')['read_time_min'].mean())

print("\n投票傾向ごとの平均文学語彙:")
print(df.groupby('vote_behavior')['literary_vocab_size'].mean())


# -*- coding: utf-8 -*-
# プログラム名: inference_causality_colab_sociology.py
# 社会学の文脈で推測統計と因果分析を行うGoogle Colab用統合コード

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.formula.api as smf
from sklearn.utils import resample

# --- データ生成（社会学的背景を考慮） ---
np.random.seed(42)
n = 200

# 性別（Male/Female）、政治参加（Support/Oppose）、年齢層
gender = np.random.choice(['Male', 'Female'], n)
vote = np.random.choice(['Support', 'Oppose'], n)
age_group = np.random.choice(['Youth', 'Adult', 'Senior'], n, p=[0.4, 0.4, 0.2])

# SNS使用時間：若者・女性が多め
sns_use = np.random.normal(3, 1.0, n) + (gender == 'Female') * 0.3 + (age_group == 'Youth') * 0.5

# 勉強時間：成人層・女性が多め
study_time = np.random.normal(2, 0.5, n) + (age_group == 'Adult') * 0.3 + (gender == 'Female') * 0.2

# 幸福度：政治参加、学習時間が影響する
happiness = 5.5 - 0.2 * sns_use + 0.4 * study_time + (vote == 'Support') * 0.5 + np.random.normal(0, 0.5, n)

df = pd.DataFrame({
    'gender': gender,
    'vote': vote,
    'age_group': age_group,
    'sns_use': sns_use,
    'study_time': study_time,
    'happiness': happiness
})

# --- 3.1 ブートストラップ（幸福度平均） ---
boot_means = [np.mean(np.random.choice(df['happiness'], 50, replace=True)) for _ in range(1000)]
plt.hist(boot_means, bins=30, color='lightgreen', edgecolor='black')
plt.title("Bootstrapped Happiness Mean Distribution")
plt.xlabel("Mean Happiness")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

# --- 3.2 信頼区間 ---
mean_h = df['happiness'].mean()
sem_h = stats.sem(df['happiness'])
ci_low, ci_high = stats.t.interval(0.95, df=n-1, loc=mean_h, scale=sem_h)
print(f"95% 信頼区間（幸福度）: {ci_low:.3f} 〜 {ci_high:.3f}")

# --- 3.3 仮説検定 ---
# t検定（男女のSNS使用）
t_stat, p_val = stats.ttest_ind(df[df['gender'] == 'Male']['sns_use'],
                                df[df['gender'] == 'Female']['sns_use'])
print(f"\nt検定（SNS使用: 男 vs 女）p = {p_val:.4f}")

# χ²検定（性別と投票傾向）
cont_table = pd.crosstab(df['gender'], df['vote'])
chi2, chi_p, _, _ = stats.chi2_contingency(cont_table)
print(f"χ²検定（性別 × 投票傾向）p = {chi_p:.4f}")

# --- 3.4 多変量回帰とブートストラップCI ---
model_multi = smf.ols('happiness ~ sns_use + study_time + C(vote) + C(gender)', data=df).fit()
boot_coeffs = []
for _ in range(1000):
    boot_sample = resample(df)
    boot_model = smf.ols('happiness ~ sns_use + study_time + C(vote) + C(gender)', data=boot_sample).fit()
    boot_coeffs.append(boot_model.params)

boot_df = pd.DataFrame(boot_coeffs)
ci_sns = np.percentile(boot_df['sns_use'], [2.5, 97.5])
ci_study = np.percentile(boot_df['study_time'], [2.5, 97.5])

# --- 結果表示 ---
print("\n多変量回帰の結果と95%ブートストラップ信頼区間：")
print(f"SNS使用 → 幸福度: {model_multi.params['sns_use']:.3f} (CI: {ci_sns[0]:.3f}〜{ci_sns[1]:.3f})")
print(f"勉強時間 → 幸福度: {model_multi.params['study_time']:.3f} (CI: {ci_study[0]:.3f}〜{ci_study[1]:.3f})")

# --- 散布図＋回帰直線 ---
sns.lmplot(x='sns_use', y='happiness', data=df, hue='gender')
plt.title("Regression: SNS Use vs Happiness (by Gender)")
plt.grid(True)
plt.show()

# -*- coding: utf-8 -*-
# プログラム名: linear_algebra_application_sociology_pca.py
# 内容：社会学的データに対する線形代数の応用（主成分分析と文化類型クラスタリング）

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- データ生成：社会的価値観に基づく例 / Simulated cultural values dataset ---
np.random.seed(0)
n = 100
df = pd.DataFrame({
    'tradition': np.random.normal(3, 0.7, n),      # 伝統重視 / Tradition
    'individualism': np.random.normal(4, 0.6, n),  # 個人主義 / Individualism
    'authority': np.random.normal(2.5, 0.8, n),    # 権威受容 / Authority
    'rationalism': np.random.normal(3.5, 0.5, n),  # 合理主義 / Rationalism
    'hedonism': np.random.normal(2.8, 0.6, n)      # 快楽主義 / Hedonism
})

# --- 標準化 / Standardization ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# --- PCA 主成分分析 / Principal Component Analysis ---
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# --- 結果格納 / Save results ---
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

# --- 文化類型を3分割クラスタで分類 / Categorize by PC1 ---
df_pca['cluster_label'] = pd.cut(df_pca['PC1'], bins=3, labels=['Tradition-Oriented', 'Mixed', 'Modern-Oriented'])

# --- 結果表示（表）---
print("\n主成分寄与率 / Explained Variance Ratio:", pca.explained_variance_ratio_)

# --- 可視化 / Visualization ---
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='cluster_label', data=df_pca, palette='Set2', s=80)
plt.title('Cultural Types by Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.legend(title='Cultural Cluster')
plt.show()


# -*- coding: utf-8 -*-
# プログラム名: applied_humanities_sociology_full.py
# 第5部：人文学・社会調査への応用を統合したPythonプログラム
# 包含内容：TF-IDFベクトル化、地域統計分析、人物ネットワーク分析

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer

# --- 5.1 テキストの数理表現（例：新聞記事）---
documents = [
    "経済成長と雇用の関係について議論が進んでいる。",
    "環境問題への対策が求められている。",
    "AI技術の進展が社会に影響を与えている。",
    "教育改革が進められており、オンライン学習が注目されている。",
    "政治と経済のつながりが議論されている。"
]

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(documents)
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
print("【TF-IDF 文書ベクトル】")
print(tfidf_df)

# --- 5.2 社会調査データの分析（地域別統計）---
demo_data = {
    'Region': ['A市', 'B市', 'C市', 'D市', 'E市'],
    'Population': [120000, 85000, 95000, 130000, 70000],
    'University_Graduates': [36000, 21000, 25000, 40000, 15000],
    'Employed': [95000, 70000, 72000, 100000, 60000]
}
demo_df = pd.DataFrame(demo_data)
demo_df['Graduate_Rate'] = demo_df['University_Graduates'] / demo_df['Population']
demo_df['Employment_Rate'] = demo_df['Employed'] / demo_df['Population']

print("\n【地域別統計と比率】")
print(demo_df[['Region', 'Graduate_Rate', 'Employment_Rate']])

# --- 5.3 ネットワーク分析（人物関係）---
G = nx.Graph()
edges = [('Alice', 'Bob'), ('Alice', 'Carol'), ('Bob', 'Dave'), ('Carol', 'Dave'), ('Dave', 'Eve')]
G.add_edges_from(edges)

centrality = nx.degree_centrality(G)
nx.set_node_attributes(G, centrality, 'centrality')

pos = nx.spring_layout(G, seed=42)
plt.figure(figsize=(8, 6))
nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=800)
nx.draw_networkx_labels(G, pos)
plt.title("Character Network (Degree Centrality)")
plt.grid(True)
plt.show()

print("\n【人物の中心性指標（degree centrality）】")
for node, c in centrality.items():
    print(f"{node}: {c:.2f}")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up